<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Byte Mining</title>
	<atom:link href="http://www.bytemining.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.bytemining.com</link>
	<description>My thoughts on data mining, machine learning, programming languages, open-source software and general nerdery.</description>
	<lastBuildDate>Tue, 15 May 2012 18:00:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>SIAM Data Mining 2012 Conference</title>
		<link>http://www.bytemining.com/2012/05/siam-data-mining-2012-conference/</link>
		<comments>http://www.bytemining.com/2012/05/siam-data-mining-2012-conference/#comments</comments>
		<pubDate>Tue, 15 May 2012 18:00:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1076</guid>
		<description><![CDATA[





Note: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!
<p></p>

From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more &#8220;big data&#8221; and &#8220;data science&#8221; oriented, and I wanted to step away from the hype and just listen to talks that had more substance.
<p></p>

Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams &#8220;fun&#8221; rather than &#8220;business&#8221;, but I managed to make time for both.

<p></p>
The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><div>
<table>
<tbody>
<tr>
<td><img class="lfloat" src="http://www.bytemining.com/wp-content/uploads/2012/05/SDM12logo.jpg" alt="" width="400" height="453" /></td>
<td>
<div><strong>Note</strong>: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!</div>
<p></p>
<div></div>
<div>From April 26-28 I had the pleasure to attend the <a href="http://www.siam.org/meetings/sdm12/">SIAM Data Mining conference in Anaheim on the Disneyland Resort</a> grounds. Aside from <a href="http://www.sigkdd.org/kdd2011/">KDD2011</a>, most of my recent conferences had been more &#8220;big data&#8221; and &#8220;data science&#8221; oriented, and I wanted to step away from the hype and just listen to talks that had more substance.</div>
<p></p>
<div></div>
<div>Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams &#8220;fun&#8221; rather than &#8220;business&#8221;, but I managed to make time for both.</div>
<div></div>
<p></p>
<div>The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to run there, wait in line, guide myself through crowds, wait in line, get my food, eat it, and run back to the conference in 90 minutes on a weekend. After lunch on the first two days was another plenary session followed by breakout sessions. The evening of the first two days was reserved for poster sessions. Saturday hosted half-day and full-day workshops.</div>
<div></div>
<p></p>
<div>Below is my summary of the conference. Of course, such a summary is very high level my description may miss things, or may not be entirely correct if I misunderstood the speaker.</div>
</td>
</tr>
</tbody>
</table>
</div>
<div></div>
<div><strong>Plenary Talks</strong></div>
<div><strong><br />
</strong></div>
<div>Bharat Rao from SIEMENS provided the first plenary talk bright and early the first day of the conference. I only got to see the first half as I could not wake up. His talk was about privacy preserving data mining in medicine using matrix factorization. Although privacy has become an important issue in data mining, I do not totally buy that it is entirely necessary. The idea is that observations should not personally identifiable. I personally do not agree that such privacy measures are necessary when only a computer system is using the data, and not an individual person. Besides, with such massive amounts of data, someone digging through gigs and gigs of personally identifiable data to find one person&#8217;s data does not seem like a viable threat. My thoughts are similar to those on the Netflix grand challenge dataset lawsuit.</div>
<p></p>
<div></div>
<div>The second plenary talk came from <a href="http://nosh.northwestern.edu/">Noshir Contractor</a>. The main point of his work seemed to be how to build effective teams using graphs and data about each of the candidates for such a team. This did not excite me itself, but it was the data his team used that excited me and some of the stuff they learned from it. The first part of the talk discussed research into NSF grants and the types of collaboration that are more likely to lead to the awarding of such grants. His group found that women were more likely to be collaborators on awarded proposals and that multidisciplinary teams were more likely to be funded. Some analogous work involved the <a href="http://dmitriwilliams.com/Farming.pdf">detection of &#8220;gold farmers&#8221; on the MMORPG game Everquest 2</a>. Gold farming involves gathering and selling virtual goods with real cash. Interestingly, Contractor&#8217;s group found that the graph signatures present in gold farming are remarkably similar to those present with drug trafficking. There were a few other interesting tidbits that the group found. They found that a great number of players only play with friends and are somewhat disconnected from the rest of the game graph. Also, male-male relationships and female-male graph links were very common, but female-female links were uncommon. Contractor hypothesized that the male-male relationships were obvious (men are more likely to play computer games) and that women often play the game with men because it was the only way for them to get time with their significant others.</div>
<p></p>
<div></div>
<div>The Friday morning talk on <a href="http://en.wikipedia.org/wiki/Inductive_transfer">transfer learning</a> came from <a href="http://www.cse.ust.hk/~qyang/">Qiang Yang from Hong Kong University</a>. Transfer learning in this context discussed how to adapt models developed in one domain to data from another domain. Transfer learning seems to be picking up steam in Machine Learning, but anybody within training in Statistics can tell you that it really is just <a href="http://en.wikipedia.org/wiki/Latent_variable">latent variable analysis</a>. Of course, transfer learning applies more to learning classifiers than building descriptive models of data. The speaker&#8217;s proposed method is called <a href="http://www.cse.ust.hk/~qyang/Docs/2009/TCA.pdf">Transfer Component Analysis (TCA)</a> which is similar to, of course, <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis (PCA)</a>. Yang found that semi-supervised TCA was useful for <a href="http://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a> in a transfer learning context. A common use of transfer learning is mapping a text classifier to an image classifier where we have few labeled instances in the image domain. We can then use unlabeled source data (text) in a semi-supervised way to create a better classifier in the image domain.</div>
<p></p>
<div></div>
<div>The last plenary talk came from <a href="http://research.microsoft.com/en-us/um/people/sdumais/">Susan Dumais from Microsoft Research</a> who discussed <a href="http://research.microsoft.com/en-us/um/people/sdumais/SIAM2012-Keynote-Dumais_Share.pdf">temporal dynamics and information retrieval</a>. The talk basically discussed how to mine concepts important concepts over time from data streams. One part of her research was discovering the staying power of certain words. Susan has noticed four distinct word behaviors based on how the density of the word&#8217;s usage changes over time: fast, hybrid, medium, and slow. Susan&#8217;s research also studies how often people <em>revisit </em>certain webpages and why. Presumably revisits are an alternative measure of influence to in-links and out-links used in PageRank (remember, Microsoft has its own anti-Google search engine). Studying temporal behavior of web visits and keyword usage is important because current methods consider only a snapshot of the web with very little evolution. Susan stated that a great page is defined as a mixture of bags of words that are formed based on page changes. Such research is important because query relevance changes over time. For example, a query of <em>US Open </em>refers to golf at certain times of the year and tennis at others. The query <em>March Madness </em>should probably return ticket prices <em>before </em>the event, scores <em>during </em>the event, and Wikipedia or sports articles recapping the event <em>after </em>the event.</div>
<p></p>
<div></div>
<div><strong>Social Media</strong></div>
<div><strong><br />
</strong></div>
<div>Social media has a session at pretty much every academic conference these days. The speakers in this session used social media data to test their hypotheses and they are always interesting. One talk discussed <a href="http://engineering.asu.edu/sites/default/files/shared/ASUCISE-2011-005.pdf">a feature selection technique for social processes</a> using data from Twitter. The method used in the paper uses user-post relations (favorites, retweets, replies) and user-user relations (following etc.). The second talk used <a href="http://www.cc.gatech.edu/~lingliu/papers/2012/TingWang-SDM2012.pdf">heat diffusion models to model the diffusion, cascading and propagation of ideas</a>. The researchers were interested in also discovering or predicting the &#8220;tipping point&#8221; (or burst of activity, in their words) or a social phenomenon. Another talk discussed <a href="http://www.cs.uiuc.edu/~hanj/pdf/sdm12_mgupta.pdf">credibility in a social network and how credible and incredible information spreads</a>. This work particularly discussed rumors and fake events such as the untimely death of Justin Bieber. Some of the questions investigated were: how can we filter these fake events out of the timeline? How do such rumors spread? The final talk in this session was a bit of an odd duck: <a href="http://arxiv.org/pdf/1102.3340.pdf">how to build a team using social network analysis</a>. The purpose of that work was to balance skillsets in a team and enhance collaborative compatibility.</div>
<div></div>
<p></p>
<div><strong>Pattern Mining</strong></div>
<div><strong><br />
</strong></div>
<div>The Thursday afternoon session I attended had a very generic name considering all of data mining is about finding patterns. Really, it should have been called &#8220;association rule mining.&#8221; Unfortunately, this session was fairly dry and was my least favorite of the conference. The one talk that really stood out to me discussed how to <a href="http://win.ua.ac.be/~adrem/bibrem/pubs/cule12marbles.pdf">mine association rules out of long temporal events</a>. Such association rules consisted of &#8220;episodes&#8221; which were partial orders on the graph of the event. The type of association rules considered were basically motifs &#8212; subsequences of interesting events that occurred within a long event.</div>
<div></div>
<p></p>
<div><strong>Kernels and Classification</strong></div>
<div><strong><br />
</strong></div>
<div>The first two talks in this session discussed <a href="http://en.wikipedia.org/wiki/Multi-label_classification">multi-label classification</a>, which is distinct from multi-class classification. In multi-class classification, we have multiple classes and each instance can belong to one, and only one class. In multi-label classification, each instance can belong to one or more classes/labels. Multi-label classification exploits correlation information among labels whereas independent classifiers do not. The first talk discussed how to use <a href="http://siam.omnibooksonline.com/2012datamining/data/papers/132.pdf">multi-label classification when there are multiple objectives</a>. For example, when buying a cell phone, we may want to minimize price, and maximize battery life. The second talk discussed <a href="http://users.ics.aalto.fi/gonen/files/gonen_sdm12_paper.pdf">dimension reduction for multi-label classification and coupling feature selection with modeling</a>. Another talk attempted to study the <a href="http://www.di.unipi.it/~ruggieri/Papers/sdm2012.pdf">theoretical principles behind pruning and grafting in decision trees</a>. The <a href="http://www.rulequest.com/Personal/">C4.5</a> software does pruning and grafting, but its theoretical properties are not well understood. The last talk discussed <a href="http://people.ee.duke.edu/~lcarin/kpmf_sdm_final.pdf">augmenting matrix factorization with graph information and other metadata</a> prior to building a model. For example, for a movie recommendation problem, one factor would be a movie and another factor would be a user. These factors can be combined into a Bayesian model that can be scaled up better than other existing methods.</div>
<div></div>
<p></p>
<div><strong>Transfer Learning</strong></div>
<div><strong><br />
</strong></div>
<div>As I mentioned earlier, the goal of transfer learning is to map a model used in one domain to another similar domain. The classic example is classifying images using models trained on text data and some labeled images &#8212; both domains are reduced to a common set of concepts. The talks in this session mainly talked about advances in latent variable analysis. I kept finding myself confused and wondering, &#8220;why is this considered groundbreaking?&#8221; The work presented in this session basically used existing models for transfer learning. The first few talks discussed using <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">Latent Dirichlet Allocation (LDA)</a> to map data into concepts, and then the third talk discussed <a href="http://books.nips.cc/papers/files/nips16/NIPS2003_AA03.pdf">Hierarchical Latent Dirichlet Allocation (hLDA)</a> which could be used for taxonomies and hierarchies of concepts. Although Transfer Learning is very useful, I did not find it to be all that groundbreaking. Of course, using text and images as the source and target domains is not incredibly interesting. I think Transfer Learning could be revolutionary if it could be applied to two very different domains.</div>
<div></div>
<p></p>
<div><strong>Full Day Workshop: Text Mining</strong></div>
<div><strong><br />
</strong></div>
<div>Of course, if there is a text mining talk, I will attend it. The workshop was led David W. Berry from University of Tennessee, Knoxville. The keynote speaker was <a href="http://www.hpl.hp.com/personal/Malu_Castellanos/">Malu Castellanos from Hewlett-Packard Labs</a>. Malu&#8217;s talk was amazing. She discussed a live customer intelligence system that is used for intent and sentiment analysis on various channels. Working with text is not easy. She began with a discussion of the many challenges in sentiment analysis including deceitful adjectives (<em>despicable </em>is negative, but <em>Despicable Me</em> is a proper noun that is not negative), dependency relations (<em>wicked</em> as slang for &#8220;good&#8221; vs. <em>wicked witch</em>), comparisons (<em>x </em>is better than <em>y</em>), spam, sarcasm, coreferences (use of the word <em>it</em>), special expressions and emoticons (LOL, <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> ), and context dependencies (<em>predicable movie </em>is negative whereas <em>predictable weather </em>may be positive). What was particularly illiuminating about Malu&#8217;s talk was that she was fairly candid about how complex HP&#8217;s sentiment analysis system is. The system does not use one model for sentiment. Different models are used to handle different kinds of tweets and based on their classifications, these tweets are ushered off to other models for further classification. For example, comparative statements are treated distinctly by the system. There may be a naive Bayes step that classifies the text as comparative or not, and then sends the tweet for further processing. She mentioned something about using special processing such as <a href="http://en.wikipedia.org/wiki/Linear_programming">linear programming</a> and <a href="http://en.wikipedia.org/wiki/Generalized_additive_model">generalized additive models (GAM)</a> to take words such as BUT, AND etc. into account. GAMs seem rare to encounter in text mining. Some other features of the system include sentiment intensity (<em>really good</em> vs. <em>good</em>) and clustering similar words by using temporal histograms (<em>tomorrow </em>and <em>2morrow </em>have similar usage patterns).</div>
<p></p>
<div></div>
<div>The first talk was from <a href="http://research.cs.queensu.ca/~skill/">David Skillicorn</a>, who recently published a book about mining large datasets. He discussed how to pick documents out of a corpus that are the most interesting. The second talk was given by a brave undergraduate student on <a href="http://trec.nist.gov/pubs/trec20/papers/Ursinus.legal.update.pdf">query expansion</a>. He did a very good job, but what was strange about this talk was that it used&#8230; <a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing">Latent Semantic Indexing</a> (&#8230;from 1990&#8230;) rather than one of the more useful and iterative models such as LDA. This brings me to my first personal &#8220;weird moment&#8221; about this workshop. There was very little discussion about modern (post 2000) topic models. This is very strange to me. Just a few months earlier, topic models were all the rage at KDD 2011. After the lunch break, there were talks about incremental online clustering of documents and discovery of patent trolls. The final sessions of the afternoon discussed extraction of hierarchies for increasing performance of multi-labeled classifiers and automatically evaluating text summarizers. Only one of the presentations in this workshop seemed to be attached to a paper.</div>
<div></div>
<div>I do not want to be critical because I am sure a lot of work goes into planning such events. I just found this workshop to be a bit weird. A lot of the methods used in the papers were quite old fashioned for text mining (LSI, regression) and the applications were also quite old-school (patents and legal documents just scream the old-fashioned use of information retrieval&#8230; library cataloging). It also seemed like a disproportionate number of the speakers had a prior relationship with the workshop chair. I am also not used to a workshop with so few associated papers.</div>
<div>
<p class="p1"><strong>Concluding Thoughts</strong></p>
<p class="p1">This was a data mining conference so of course I enjoyed it. I must say though that the vibe was very different from some of the other conferences I have attended like KDD and <a href="http://ijcai.org/">IJCAI</a>. Most of the speakers came from overseas, and as someone with hearing loss, it was very difficult to understand many of the speakers. It also seemed like there were very few people just attending the conference. It seemed like the majority of the people at the conference were presenting, or had a poster etc. and that is different from what I am used to. Because of that, I felt like the usual community feel was a bit missing. Additionally, there was no mention of Hadoop or R, and I found that a bit concerning since every other conference I have been to has speakers that are proud to contribute to those open-source projects. And then there was the weird text mining workshop (could have just been an off-year). Could it be because SIAM is a mathematics group? Not sure. All in all, I still had a great time and learned a lot as always.</p>
<p class="p1"><em>Of course, my attendance would not have been possible without sponsorship and support from my company, <a href="http://www.gumgum.com">GumGum</a>. I attended this conference as part of my position as Data Scientist.</em></p>
<p class="p1"><strong>Disneyland</strong></p>
<p class="p1">Of course, the white elephant in every room of the conference was the fact that Disneyland was only a 5-10 minute walk away. I got a 2-day park hopper pass and spent my lunch hours and evenings at both Disneyland and California Adventure. It really is the Happiest Place on Earth. Just being there I forget about stress and the things that worry me. I had a great time walking around and watching all the kids have fun. At Disneyland I only went on a few rides: Space Mountain, Pirates of the Carribbean, Haunted Mansion, It&#8217;s a Small World and the Disneyland Railroad (not really a ride). I also got to ride the Monorail for the first time. At California Adventure I only did the California Screaming roller coaster and Soaring Over California which features my hometown (the part with the orange orchards). Unfortunately, I missed Tom Sawyer Island again. I will have to go there first next time!</p>
<table>
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/05/526086_10101382148210726_2507506_66859310_1965326370_n1.jpg" alt="" width="360" height="270" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/05/545812_10101382145920316_2507506_66859290_787607657_n.jpg" alt="" width="360" height="270" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/05/581239_10101382146229696_2507506_66859292_1227525982_n.jpg" alt="" width="270" height="360" /></td>
</tr>
<tr>
<td><em><span style="font-size: small;">The view of California Adventure from my hotel room!</span></em></td>
<td><span style="font-size: small;"><em>A room just for kids.</em></span></td>
<td><span style="font-size: small;"><em>Surfer Goofy at the lobby entrance.</em></span></td>
</tr>
</tbody>
</table>
<p class="p1">
</div>
<div class="shr-publisher-1076"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F05%2Fsiam-data-mining-2012-conference%2F' data-shr_title='SIAM+Data+Mining+2012+Conference'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/05/siam-data-mining-2012-conference/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>My Interview about the Statistics Major</title>
		<link>http://www.bytemining.com/2012/03/my-interview-about-the-statistics-major/</link>
		<comments>http://www.bytemining.com/2012/03/my-interview-about-the-statistics-major/#comments</comments>
		<pubDate>Fri, 16 Mar 2012 20:23:25 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1070</guid>
		<description><![CDATA[<p>Recently, I participated in an email interview about what being a Statistics major entailed, how I got interested in the field and the future of Statistics. I figured this might be of interest to those that are contemplating majoring in Statistics, or considering a career in Data Science.</p>

Q1: Why did you decide to pursue a major in statistics in college?
<p>A: &#8220;When I was a kid, I really enjoyed looking at graphs, plots and maps. My parents and I could not make of what was behind the interest. At the same time, I was also heavily interested in education. My mother was a teacher and the first set of statistics I ever encountered were standardized test scores. I strived to understand what the scores attempted to say about me, and why such scores and tests are so trustworthy. When the stakes increased with the AP and SAT exams, I began reading articles published by the Educational Testing Service and learned a ton about how these tests are constructed to minimize bias, and how scores are comparable across forms. It fascinated me how much science goes into these tests, but in the end of the day they are still just one factor [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Recently, I participated in an email interview about what being a Statistics major entailed, how I got interested in the field and the future of Statistics. I figured this might be of interest to those that are contemplating majoring in Statistics, or considering a career in Data Science.</p>
<div></div>
<div><strong>Q1:</strong> Why did you decide to pursue a major in statistics in college?</div>
<p><strong>A: </strong>&#8220;When I was a kid, I really enjoyed looking at graphs, plots and maps. My parents and I could not make of what was behind the interest. At the same time, I was also heavily interested in education. My mother was a teacher and the first set of statistics I ever encountered were standardized test scores. I strived to understand what the scores attempted to say about me, and why such scores and tests are so trustworthy. When the stakes increased with the AP and SAT exams, I began reading articles published by the Educational Testing Service and learned a ton about how these tests are constructed to minimize bias, and how scores are comparable across forms. It fascinated me how much science goes into these tests, but in the end of the day they are still just one factor in the whole picture of a student. This niche interest lead me to statistics, psychometrics in particular, and although I no longer study psychometrics, I found what I learned to be incredibly valuable.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q2: </strong>&#8220;I noticed you have bachelor&#8217;s, master&#8217;s, and doctoral degrees in statistics. How did your graduate study build on what you learned in your undergraduate program?&#8221;</p>
<p><strong>A: </strong>&#8220;For me, the undergraduate and graduate programs were night and day. The undergraduate program focused more on modeling and data analysis. The graduate program focused more on thinking about data and how to develop a scientific &#8220;common sense&#8221; about how to work with, express and make automated decisions based on data. The graduate program was much more mathematically and computationally intensive than the undergraduate major. My graduate study actually built more on my mathematics major in college because many of the concepts in graduate statistics require knowledge of linear algebra, numerical analysis and real analysis. Fortunately, our statistics major requires upper division math courses.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q3: </strong>What was the most interesting part of majoring in statistics? What did you find most challenging?</p>
<p><strong>A: </strong>&#8220;The most interesting part of majoring in statistics was seeing how many fields can grow and transform based on insights from data and statistics. In my case, I found it most interesting seeing how it integrates and interacts with computer science. Every time someone surfs through Facebook, enters a Google search, or looks at an item on Amazon, data about what you are doing are collected and algorithms process this data to enrich the experience by, say,recommending books and offering special deals on Amazon, recommending friends and showing relevant stories on Facebook, and the most groundbreaking of all: returning relevant search results.</p>
<p>The most challenging part for me was the mathematical theory. Although I loved math, I sometimes had trouble connecting the theory to the application, and statistics is such an applied field. I look at it as a rite of passage and once I saw enough theory relevant to my interests, the learning process became easier.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q4: </strong>How do you apply what you learned in your statistics education in your current line of work?</p>
<p><strong>A: </strong>&#8220;It is ironic, but it is the more basic concepts of statistics and probability that I use everyday rather than the complicated models I learned. Concepts such as independence, confidence, power, accuracy etc. are important building blocks for building my own models, or for choosing an existing one from those that I learned in school.</p>
<p>I always start with some exploratory analysis such as computing some statistics and making plots that show relationships clearly. Then I set explicit guidelines for the input and output of the model I want to build and note any critical assumptions that are violated or that must be met. I then try several different methods and models and validate their results using common metrics taught in undergraduate statistics before settling on a final model configuration.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q5: </strong>What skills did you learn in the statistics major that you find useful for work and everyday life?</p>
<p><strong>A: </strong>&#8220;The training in mathematics I received as part of the statistics major taught me how to think logically, and this is very important in my work in computer science. I think patience was another very important skill I learned. I love what I do, and sometimes I take for granted that others have the same mathematical training that I do because I am so entrenched in it. Through my experience teaching as well as consulting as a student, I gained a better sense of the challenges and difficulties many people face when thinking about and interpreting statistics and how to better communicate results and ideas.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q6: </strong>Any advice for students who are considering majoring in statistics?</p>
<p><strong>A: </strong>&#8220;My advice for students majoring in statistics is to choose an additional major or minor that uses statistics and is of interest to the student. I do not consider statistics to be a &#8220;standalone&#8221; major. When interviewing for a job, employers want to know why an interviewee is passionate about their company. For example, if interviewing for a finance company, the company wants to hear about passion for finance, and see education or experience in such fields. Another way to accomplish this instead of double majoring is to do some internships, projects or research in a field of interest.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Conclusion: </strong>Finally, could you tell me a little about yourself for an intro bio we will include before the Q&amp;A interview? For instance, what university(ies) did you attend, what degree(s) have you earned, what is your current job title, where do you work and for how long&nbsp;(you can be general here, or include a link to your professional website or blog if you have one), and what are you career goals?</p>
<p><strong>A: </strong>&#8220;I attended University of California, Los Angeles (UCLA) for my B.S. (Statistics, Mathematics of Computation), two M.S. (Statistics and Computer Science) and Ph.D. (Statistics). I currently work for an Internet advertising startup in Santa Monica, CA as Chief Data Scientist/Research Engineer, and have been working in the field for three years. Whenever I get a free moment, I write about statistics, data mining and computer science topics on my blog at <a class="moz-txt-link-freetext" href="http://www.bytemining.com/">http://www.bytemining.com</a>. I plan on dedicating the rest of my life working with and communicating about data and turning online phenomena into knowledge that can be used to progress technology and change the world!&#8221;</p>
<div class="shr-publisher-1070"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F03%2Fmy-interview-about-the-statistics-major%2F' data-shr_title='My+Interview+about+the+Statistics+Major'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/03/my-interview-about-the-statistics-major/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Hold Only That Pair of 2s?&#8221; Studying a Video Poker Hand with R</title>
		<link>http://www.bytemining.com/2012/01/hold-only-that-pair-of-2s-studying-a-video-poker-hand-with-r/</link>
		<comments>http://www.bytemining.com/2012/01/hold-only-that-pair-of-2s-studying-a-video-poker-hand-with-r/#comments</comments>
		<pubDate>Sun, 08 Jan 2012 09:32:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1013</guid>
		<description><![CDATA[<p>Whenever I tell people in my family that I study Statistics, one of the first questions I get from laypeople is &#8220;do you count cards?&#8221; A blank look comes over their face when I say &#8220;no.&#8221;</p>
<p>Look, if I am at a casino, I am well aware that the odds are against me, so why even try to think that I can use statistics to make money in this way? Although I love numbers and math, the stuff flows through my brain all day long (and night long), every day. If the goal is to enjoy and have fun, I do not want to sit there crunching probability formulas in my head (yes that&#8217;s fun, but it is also work). So that leaves me at the video Poker machines enjoying the free drinks. Another positive about video Poker is that $20 can sometimes last a few hours.&#160;So it should be no surprise that I do not agree with using Poker to teach probability. &#160;Poker is an extremely superficial way to introduce such a powerful tool and gives the impression that probability is a way to make a quick buck, rather than as an important tool in science and society. The only [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Whenever I tell people in my family that I study Statistics, one of the first questions I get from laypeople is &#8220;do you count cards?&#8221; A blank look comes over their face when I say &#8220;no.&#8221;</p>
<p>Look, if I am at a casino, I am well aware that the odds are against me, so why even try to think that I can use statistics to make money in this way? Although I love numbers and math, the stuff flows through my brain all day long (and night long), every day. If the goal is to enjoy and have fun, I do not want to sit there crunching probability formulas in my head (yes that&#8217;s fun, but it is also work). So that leaves me at the video Poker machines enjoying the free drinks. Another positive about video Poker is that $20 can sometimes last a few hours.&nbsp;So it should be no surprise that I do not agree with using Poker to teach probability. &nbsp;Poker is an extremely superficial way to introduce such a powerful tool and gives the impression that probability is a way to make a quick buck, rather than as an important tool in science and society. The only time that I have used Poker in teaching (besides when required), is to cover the hypergeometric distribution and sampling without replacement.</p>
<p>Since I took Intro Probability Theory, I have always wondered what to do in the following situation. Say a pair of cruddy low cards appear on the first draw. The game only awards money for pairs of jacks or better. If all I have in the hand is a pair of low cards and no&nbsp;face cards, my decision is easy: hold the pair of low cards. But what if there is at least one face card showing (no other pairs)? Pictorially this looks like</p>
<div><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_club_2.svg_.png" alt="" width="200" height="250" /><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_club_5.svg_.png" alt="" width="200" height="250" /><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_spade_J.svg_.png" alt="" width="200" height="250" /><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_diamond_2.svg_.png" alt="" width="200" height="250" /><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_diamond_10.svg_.png" alt="" width="200" height="250" /></div>
<p>The conundrum:</p>
<ol>
<li>Hold the two low cards and deal, hoping for a three of a kind, or</li>
<li>Hold the two low cards AND one of the face cards, hoping for a three of a kind, OR a pair of Jacks of Better.</li>
</ol>
<p>Under each of these decisions, which yields the highest probability of winning <em>something</em> and which one yields the highest payout? This problem can be solved exactly by using combinatorics, conditional probability and expectation, but since a video poker game is basically a simulator (though likely biased), I wrote my own simulation. <strong>For the answer, scroll to the end!</strong></p>
<p><strong>Data Structure</strong></p>
<p>In most card games, we would want to store the state of the game: the outstanding cards in the deck(s), and the hand(s) of each player. In standard video poker, there is one deck, and one player, so only the player hand needs to be recorded because every card in the deck is either in the hand, or it is not. One obvious way to represent the hand is as an array of denomination/suit tuples in an array. Unfortunately, this data structure requires other data structures to store the possible suits, and possible denominations. It is also more tedious to detect certain kinds of wins. For this simulation, I use a 13 x 4 matrix where each row is a different denomination, and each column is each of the four suits. This matrix allows us to easily see which cards are possible to be dealt. Additionally, this matrix, as well as vector-based languages such as R, make&nbsp;it easy to detect wins. Such a matrix looks like the following for the hand <strong>2</strong><strong style="font-family: sans-serif; font-size: 13px; line-height: 19px; background-color: #ffffff;"><span class="spades">&spades;</span>&nbsp;<span class="clubs">5&clubs;</span>&nbsp;<span class="hearts" style="color: red;">8&hearts;</span>&nbsp;<span class="clubs">8&clubs;</span>&nbsp;<span class="diamonds" style="color: red;">A&diams;</span></strong></p>
<div><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/01/matrix.png" alt="" width="220" height="303" />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</div>
<div><img style="vertical-align: baseline; display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/01/matrix2.png" alt="" width="153" height="41" /></div>
<div>where <em>Cij&nbsp;</em>denotes a card,<em>&nbsp;i </em>is the denomination <img src='http://s.wordpress.com/latex.php?latex=i%20%5Cin%20%5C%7B%202%2C%20%5Cldots%2C%2010%5C%7D%20%5Ccup%20%5C%7BJ%2C%20Q%2C%20K%2C%20A%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i \in \{ 2, \ldots, 10\} \cup \{J, Q, K, A\}' title='i \in \{ 2, \ldots, 10\} \cup \{J, Q, K, A\}' class='latex' /> and <em>j </em>is the suit <img src='http://s.wordpress.com/latex.php?latex=j%20%5Cin%20%5C%7B%5Cheartsuit%2C%20%5Cdiamondsuit%2C%20%5Cspadesuit%2C%20%5Cclubsuit%20%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j \in \{\heartsuit, \diamondsuit, \spadesuit, \clubsuit \}' title='j \in \{\heartsuit, \diamondsuit, \spadesuit, \clubsuit \}' class='latex' /> and <em>H</em>&nbsp;is the player&#8217;s hand in question.</div>
<div>
<p><strong>Detecting Wins</strong></div>
<p>Poker wins are not disjoint. A three of a kind involving Jacks is also a pair of Jacks or better, etc. When checking wins, I start with the lowest paying win, and move up to Royal Flush, only keeping track of the highest win. Thus, this algorithm detects a four-of-a-kind involving Queens as Jacks or Better, two pairs of Queens, and a three-of-a-kind of Queens, but only counts it as the highest win, the four-of-a-kind.</p>
<ol>
<li><em>Pair of Jacks or Better</em>: a pair of Jacks, Queens, Kings or Aces. In <strong>A</strong>, this is simply the condition that at least one row in rows 10 through 13 has a row sum greater than 1.</li>
<li><em>Two pair</em>: two pairs of anything. In <strong>A</strong>, this is the condition that at least two rows have a sum greater than 1.</li>
<li><em>Three of a kind</em>: three of any card. In <strong>A</strong>, this is the condition that at least one row has a sum of at least 3.</li>
<li><em>Straight</em>: all 5 cards can be permuted such that they form an ascending sequence: A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A. This case is interesting and will be discussed in a bit.</li>
<li><em>Flush</em>: all 5 cards are of the same suit. In <strong>A</strong>, this is the condition that at least one column has a sum of at least 5.</li>
<li><em>Full House</em>: one three-of-a-kind, and a pair of anything. In <strong>A</strong>, this is the condition that a row has sum 3, and another row has sum 2.</li>
<li><em>Four of a Kind</em>: 4 of any card. In <strong>A</strong>, this is the condition that a row has sum 4.</li>
<li><em>Straight Flush</em>: the 5 cards can be permuted to form an ascending sequence and are all of the same suit. In <strong>A</strong>, this is simply the condition that we have a straight and a flush in the same hand.</li>
<li><em>Royal Flush</em>: a straight flush with the Ace as the high card. In <strong>A</strong>, this is simply the condition that we have a straight flush AND the sum of row 13 is 1.</li>
</ol>
<div>Of course, this &#8220;short circuit logic&#8221; only works for a game containing 5 cards. Also, note that under my scenario (a pair of low cards is dealt first), it is never possible to have a straight, flush, royal flush, or straight flush as the highest wins. Also, it is not possible to have Jacks or Better as the highest win because we already have one pair (low cards), and if we randomly are drawn a pair of Jacks or Better, we then have two pairs as the highest win.</div>
<div><em>Detecting the Straight:&nbsp;</em>In <strong>A</strong>, we have a straight when five successive rows have sum equal to 1. We can do this iteratively, but there is a better way. Note that if all of the row sums are 0 or 1, we can treat the vector of row sums as a binary number and convert it to its integer representation. Each binary number has 13 bits. If we let 2 be the zeroth power, then straights will lead to the following binary and integer representations:</div>
<div></div>
<div><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/01/matrix3.png" alt="" width="614" height="125" /></div>
<div></div>
<p>
<strong>Bug alert:</strong> It just occurred to me that there are many more wrap-around straights such as <emph>Q, K, A, 2, 3</emph>. This will be fixed this evening.<br />
</p>
<div>From basic computer science and number theory, every natural number can be written as the sum of distinct powers or 2 and the representation of such an integer is unique. Furthermore, the sum of <em>n </em>successive powers of 2 is divisible by <img src='http://s.wordpress.com/latex.php?latex=2%5En%20-%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^n - 1' title='2^n - 1' class='latex' />. After some experimentation I came up with the following rule: if all of the row sums are 0/1 and the integer representation of this binary vector is divisible by <img src='http://s.wordpress.com/latex.php?latex=%5Cfrac%7B2%5E5-1%7D%7B2%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\frac{2^5-1}{2}' title='\frac{2^5-1}{2}' class='latex' />, then <strong>A </strong>is a straight. The only straight that does not fit this pattern is the wrap-around straight: J, Q, K, A, 2 which can be checked manually.</div>
<div></div>
<div><strong>The Algorithm</strong></div>
<div>
<ol>
<li>Randomly generate a hand containing a pair of low cards (2-10) and at least one face card.</li>
<li>Hold the pair of low cards. Under strategy 2, hold one (and only one) of the face cards.</li>
<li>Discard the unheld cards from the deck and draw 2 or 3 cards at random from the same deck.</li>
<li>Check for wins.</li>
<li>Increment a win counter.</li>
<li>Repeat steps 1-5 tons of times, recording the percentage of hands that yielded a win, of the <em>n </em>games/hands played.</li>
</ol>
<p><strong>Results: Hold the Pair of Low Cards <em>Only</em></strong></p>
<p>My usual strategy is to always hold the low pair and take one face card along for the ride. That way, I hopefully match one of the two denominations I hold. My parents on the other hand, always told me to hold the low pair only, because that gives one more card (degree of freedom) for a win. It turns out they were right. Each game consisted of 1,000 hands. A percentage of these hands yields a win. This percentage is a random variable, so I ran this simulation to play 1,000 games. The table below shows the distribution of the win percentages.</p>
<p style="text-align: left;">
<div align="center"><img src="http://www.bytemining.com/wp-content/uploads/2012/01/pokertable.png" alt="" width="805" height="145" /></div>
<p><em>Note that under strategy 1 (hold low pair only), <span style="text-decoration: underline;">all</span>&nbsp;wins are more likely than under strategy 2! </em>Of course, the estimate in the last column is an average; the mean in this case. The plot below shows the distribution of win percentages for both strategies.&nbsp;</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/01/pokersim1.png" alt="" width="600" height="400" /></p>
<ol> </ol>
<p><strong>The Code</strong></p>
<p>The code for my simulation is below. Note that it can easily be modified for your own target hands of interest. In my simulation, certain functions were never used because certain winning hands were not possible.&nbsp;</p>
<p><script src="https://gist.github.com/1608866.js"> </script></p>
<p><strong>DISCLAIMER: </strong>I did this for fun, and it is possible that there are bugs or problems with my code, algorithm or simulation. The results seem correct because I empirically I seem to do about the same using either strategy, and in a gambling perspective, an 8% discrepancy is not likely to set off bells in the head.</p>
</div>
<div class="shr-publisher-1013"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F01%2Fhold-only-that-pair-of-2s-studying-a-video-poker-hand-with-r%2F' data-shr_title='%22Hold+Only+That+Pair+of+2s%3F%22+Studying+a+Video+Poker+Hand+with+R'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/01/hold-only-that-pair-of-2s-studying-a-video-poker-hand-with-r/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Merry Christmas 2011 From Byte Mining!</title>
		<link>http://www.bytemining.com/2011/12/merry-christmas-2011-from-byte-mining/</link>
		<comments>http://www.bytemining.com/2011/12/merry-christmas-2011-from-byte-mining/#comments</comments>
		<pubDate>Sat, 24 Dec 2011 19:28:44 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1009</guid>
		<description><![CDATA[</p>
To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! This year, I am thankful for the community that has sprung up around Data Science and open-source data collection and processing. This blog is almost two years old, and like with Twitter, I have been able to communicate with many data scientists, enthusiasts and some of the most prolific contributors to the data science software community. I am thankful for all of the wonderful people I have met and have yet to meet, and for your comments and reading.  


]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><div align="center"><img src="http://www.bytemining.com/wp-content/uploads/2010/12/merry_christmas.jpg" alt="" /></p>
<div align="left">To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! This year, I am thankful for the community that has sprung up around Data Science and open-source data collection and processing. This blog is almost two years old, and like with Twitter, I have been able to communicate with many data scientists, enthusiasts and some of the most prolific contributors to the data science software community. I am thankful for all of the wonderful people I have met and have yet to meet, and for your comments and reading. <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> 
</div>
</div>
<div class="shr-publisher-1009"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F12%2Fmerry-christmas-2011-from-byte-mining%2F' data-shr_title='Merry+Christmas+2011+From+Byte+Mining%21'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/12/merry-christmas-2011-from-byte-mining/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9</title>
		<link>http://www.bytemining.com/2011/11/parsing-wikipedia-articles-wikipedia-extractor-and-cloud9/</link>
		<comments>http://www.bytemining.com/2011/11/parsing-wikipedia-articles-wikipedia-extractor-and-cloud9/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:00:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=947</guid>
		<description><![CDATA[<p>



Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

article content and template pages
article content with revision history (huge files)
article content including user pages and talk pages
redirect graph
page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
image metadata
site statistics




<p>The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.</p>
<p>As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:</p>
<p>There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That&#8217;s why [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><table>
<tr>
<td>
<img src="http://www.bytemining.com/wp-content/uploads/2011/11/Wikipedia-logo.png" alt="" width="100" height="100" /></td>
<td>Lately I have doing a lot of work with the <a href="http://dumps.wikimedia.org/enwiki/">Wikipedia XML dump</a> as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include
<ul>
<li>article content and template pages</li>
<li>article content with revision history (huge files)</li>
<li>article content including user pages and talk pages</li>
<li>redirect graph</li>
<li>page-to-page link lists: redirects, categories, image links, page links, interwiki etc.</li>
<li>image metadata</li>
<li>site statistics</li>
</ul>
</td>
</tr>
</table>
<p>The above resources are available not only for Wikipedia, but for other <a href="http://www.wikimedia.org">Wikimedia Foundation</a> projects such as <a href="http://www.wiktionary.org">Wiktionary</a>, <a href="http://www.wikibooks.org">Wikibooks</a> and <a href="http://www.wikiquotes.org">Wikiquotes</a>.</p>
<p>As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the <a href="http://www.mediawiki.org">MediaWiki</a> project. As <a href="http://dirkriehle.com/2011/05/01/the-parser-that-cracked-the-mediawiki-code/">Dirk Riehl</a> stated:</p>
<blockquote><p>There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That&rsquo;s why there are 30+ failed attempts at writing alternative parsers.</p>
</blockquote>
<p>For example, below is an excert of Wiki-syntax for a page on data mining.</p>
<pre class="brush: plain; title: ; notranslate">
'''Data mining''' (the analysis step of the '''knowledge discovery in databases''' process,&lt;ref name=&quot;Fayyad&quot;&gt; or KDD),
a relatively young and interdisciplinary field of [[computer science]]&lt;ref name=&quot;acm&quot; /&gt;
{{cite web|url=http://www.sigkdd.org/curriculum.php |title=Data Mining Curriculum |
publisher=[[Association for Computing Machinery|ACM]] [[SIGKDD]] |date=2006-04-30 |accessdate=2011-10-28}}
&lt;/ref&gt;&lt;ref name=brittanica&gt;{{cite web | last = Clifton | first = Christopher | title = Encyclopedia Britannica: Definition
of Data Mining | year = 2010 | url = http://www.britannica.com/EBchecked/topic/1056150/data-mining |
accessdate = 2010-12-09}}&lt;/ref&gt; is the process of discovering new patterns from large [[data set]]s
involving methods at the intersection of [[artificial intelligence]], [[machine learning]], [[statistics]] and
[[database system]]s.&lt;ref name=&quot;acm&quot;&gt; The goal of data mining is to extract knowledge from a data set in a
human-understandable structure&lt;ref name=&quot;acm&quot; /&gt; and involves database and [[data management]],
[[Data Pre-processing|data preprocessing]], [[statistical model|model]] and [[Statistical inference|inference]]
considerations, interestingness metrics, [[Computational complexity theory|complexity]] considerations, post-processing
of found structure, [[Data visualization|visualization]] and [[Online algorithm|online updating]].&lt;ref name=&quot;acm&quot; /&gt;
</pre>
<p>I was epicly worried that I would spend weeks writing my own parser and never complete the project I am working on at work. To my surprise, I found a fairly good parser. Since I am working on <a href="http://en.wikipedia.org/wiki/Named-entity_recognition">named entity extraction</a> and <a href="http://en.wikipedia.org/wiki/Collocation"><em>n</em>gram extraction</a>, I wanted to only extract the plain text. If we take the above junk and extract only the plain text, we would get&nbsp;</p>
<pre class="brush: plain; title: ; notranslate">
Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young
and interdisciplinary field of computer science is the process of discovering new patterns from large data sets
involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems.
The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves
database and data management, data preprocessing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of found structure, visualization and online updating.
</pre>
<p>and from this we can remove punctuation (except sentence terminators .?!), convert to lower case and perform other pre-processing text mining steps. There are many, many Wikipedia parsers of various qualities. Some do not work at all, some work only on certain articles, some have been abandoned as incomplete and some are slow as molasses.</p>
<p>I was delighted to stumble upon <a href="http://medialab.di.unipi.it/wiki/Wikipedia_Extractor">Wikipedia Extractor</a>, a Python library developed by <a href="http://www.cli.di.unipi.it/~fuschett/">Antonio Fuschetto</a>, <a href="http://www.cli.di.unipi.it/">Multimedia Laboratory, Dipartimento di Informatica, Universit&agrave; di Pisa</a>, that extracts plain-text from the Wikipedia XML dump file. The script is heavily object-oriented, and it is very easy to modify and extend for other purposes. For me, it is the easiest parser to use and yields the best quality output although there are other options.</p>
<p><strong>Pros</strong></p>
<ul>
<li>Very easy to run; it&#8217;s just a Python script.</li>
<li>Yields high quality output; no stray wikisyntax garbage.</li>
<li>Highly object-oriented; easy to extend and embed in text mining projects.</li>
<li>Object-oriented style makes it easier to parallelize with lightweight processes (written by the user).</li>
<li>Allows specifying the maximum size of each produced file (good for sending to S3).</li>
<li>It is written in Python.</li>
</ul>
<p><strong>Cons</strong></p>
<ul>
<li>Far too slow. Python profilers show major overhead involved in regex search and replace, and string replacement.</li>
<li>Is not perfect, but one of the best I have seen. For some reason, Wikilinks are converted to HTML links. Correcting this required modifying the source code.</li>
<li>Retooling the package to work with Hadoop Streaming is not too difficult, but requires some work and grokery that should be easier.</li>
</ul>
<p>Wikipedia Extractor is good for offline analysis, but users will probably want something that runs faster. Wikipedia Extractor parsed the entire Wikipedia dump in approximately 13 hours, on one core, which is quite painful. Add in further parsing and the processing time becomes unbearable even on multiple cores. A Hadoop Streaming job using Wikipedia Extractor as well as too much file I/O between Elastic MapReduce and S3 required 10 hours to complete on 15 c1.medium instances.&nbsp;</p>
<p><a href="http://blog.kenweiner.com/">Ken Weiner</a>&nbsp;(<a href="http://twitter.com/kweiner">@kweiner</a>) recently re-introduced me to the <a href="http://lintool.github.com/Cloud9/">Cloud9</a>&nbsp;package by <a href="http://www.umiacs.umd.edu/~jimmylin/">Jimmy Lin</a> (<a href="http://twitter.com/lintool">@lintool</a>) of Twitter which fills in some of these gaps. I avoided it at first because Java is not the first language I like to turn to. Cloud9 is written in Java and designed for use with Hadoop MapReduce in mind. There is a method within the package that explicitly extracts the body text of each Wikipedia article. This method calls the <a href="http://code.google.com/p/gwtwiki/">Bliki</a> Wikipedia parsing library. One common problem with these Wikipedia parsers is that they often leave syntax still in the output. Jimmy seems to wrap Bliki with his own code to do a better job of extracting high quality text only output. Cloud9 also has counters and functions that detect non-article content such as redirects, disambiguation pages, and more.</p>
<p>Developers can introduce their own analysis, text mining and NLP code to process the article text in the mapper or reducer code. An example job distributed with Cloud9 which simply counts the number of pages in the corpus took approximately 15 minutes to run on 8 cores on an EC2 instance. A job that did more substantial required 3 hours to complete, and once the corpus was refactored as <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html">sequence files</a>, the same job took approximately 90 minutes to run.</p>
<p><strong>Conclusion</strong></p>
<p>I am looking forward to playing with Cloud9 some more&#8230; I will take 90 minutes over 10 hours any day! Wikipedia Extractor is an impressive Python package that does a very good job of extracting plain text from Wikipedia articles and for that I am grateful. Unfortunately, it is far too slow to be used on a pay-per-use system such as <a href="http://aws.amazon.com">AWS</a> or for quick processing. Cloud9 is a Java package designed with scalability and MapReduce in mind, allowing much quicker and more wallet friendly processing.</p>
<div class="shr-publisher-947"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F11%2Fparsing-wikipedia-articles-wikipedia-extractor-and-cloud9%2F' data-shr_title='Parsing+Wikipedia+Articles%3A+Wikipedia+Extractor+and+Cloud9'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/11/parsing-wikipedia-articles-wikipedia-extractor-and-cloud9/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>LexisNexis Open-Sources its Hadoop Alternative</title>
		<link>http://www.bytemining.com/2011/09/lexisnexis-open-sources-its-hadoop-alternative/</link>
		<comments>http://www.bytemining.com/2011/09/lexisnexis-open-sources-its-hadoop-alternative/#comments</comments>
		<pubDate>Sat, 10 Sep 2011 03:33:25 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=938</guid>
		<description><![CDATA[<p>A month ago, I wrote about alternatives to the Hadoop MapReduce platform and HPCC was included in that article. For more information, see here.</p>
<p>LexisNexis has open-sourced its alternative to Hadoop, called High Performance Computing Cluster. The code is available on GitHub. For years the code was restricted to LexisNexis Risk Solutions. The system contains two major components:</p>

Thor (Thor Data Refinery Cluster) is the data processing framework. It &#8220;crunches, analyzes and indexes huge amounts of data a la Hadoop.&#8221;
Roxie (Roxy Radid Data Delivery Cluster) is more like a data warehouse and is designed with quick querying in mind for frontends.

<p>The protocol that drives the whole process is the Enterprise Control Language which is said to be faster and more efficient than Hadoop&#8217;s version of MapReduce. A picture is a much better way to show how the system works. Below is a diagram from the Gigaom article from which most of this information originates.</p>
<p></p>
<p>To me, Roxie seems much more exciting because it seems to complement (or replace) several technologies currently in the space. I do not know all the details, but it seems to potentially encapsulate technologies such as HBase, Hive, RabbitMQ and MemcacheDB, technologies that are common used to query and [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/09/update.jpg" alt="" width="180px" /><em>A month ago, I wrote about <a href="http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/">alternatives to the Hadoop MapReduce platform</a> and HPCC was included in that article. For more information, <a href="http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/">see here</a>.</em></p>
<p><a href="http://www.lexisnexis.com">LexisNexis</a> has open-sourced its alternative to <a href="http://hadoop.apache.org/">Hadoop</a>, called <a href="http://gigaom.com/cloud/lexisnexis-open-sources-its-hadoop-killer/">High Performance Computing Cluster</a>. The code is available on <a href="https://github.com/hpcc-systems">GitHub</a>. For years the code was restricted to LexisNexis Risk Solutions. The system contains two major components:</p>
<ul>
<li><strong>Thor </strong>(Thor Data Refinery Cluster) is the data processing framework. It &#8220;crunches, analyzes and indexes huge amounts of data a la Hadoop.&#8221;</li>
<li><strong>Roxie </strong>(Roxy Radid Data Delivery Cluster) is more like a data warehouse and is designed with quick querying in mind for frontends.</li>
</ul>
<p>The protocol that drives the whole process is the Enterprise Control Language which is said to be faster and more efficient than Hadoop&#8217;s version of MapReduce. A picture is a much better way to show how the system works. Below is a diagram <a href="http://gigaom.com/cloud/lexisnexis-open-sources-code-for-hadoop-alternative/">from the Gigaom article from which most of this information originates</a>.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2011/09/img_hpcc_arch.jpg" alt="" width="680" height="358" /></p>
<p>To me, Roxie seems much more exciting because it seems to complement (or replace) several technologies currently in the space. I do not know all the details, but it seems to potentially encapsulate technologies such as <a href="http://hbase.apache.org/">HBase</a>, <a href="http://hive.apache.org/">Hive</a>, <a href="http://www.rabbitmq.com/">RabbitMQ</a> and <a href="http://memcached.org/">MemcacheDB</a>, technologies that are common used to query and speed data to a web frontend.</p>
<p>My opinion on HPCC is mixed. Although Hadoop has already taken off in usage, LexisNexis is a very strong institution and could potentially convince some corporate users to use their system instead &#8212; those that do not want to use <a href="http://research.microsoft.com/en-us/projects/dryad/">Microsoft&#8217;s Dryad project</a>. I do not see HPCC being a Hadoop killer, just as I do not see <a href="http://www.spark-project.org/">Spark</a> or any other alternative to be a Hadoop killer. However, if HPCC does become a strong alternative, I sense this could be trouble for some of the newer players in the Hadoop field such as HortonWorks and MapR. I do not have much of an interest in studying business and competition, but <a href="http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/">Hadoop Summit 2011</a> showed that the Hadoop space has become crowded, and small breakthroughs such as another company developing a similar project is enough to add volatility and uncertainty for all involved.</p>
<div class="shr-publisher-938"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F09%2Flexisnexis-open-sources-its-hadoop-alternative%2F' data-shr_title='LexisNexis+Open-Sources+its+Hadoop+Alternative'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/09/lexisnexis-open-sources-its-hadoop-alternative/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SIGKDD 2011 Conference &#8212; Days 2/3/4 Summary</title>
		<link>http://www.bytemining.com/2011/08/sigkdd-2011-conference-days-234-summary-3/</link>
		<comments>http://www.bytemining.com/2011/08/sigkdd-2011-conference-days-234-summary-3/#comments</comments>
		<pubDate>Sat, 27 Aug 2011 18:00:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Programming Languages]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=917</guid>
		<description><![CDATA[<p></p>
<p>&#60;&#60; My review of Day 1.</p>
<p>I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.</p>
<p>Keynotes</p>
<p>KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year&#8217;s conference had a few big names.</p>

Steven Boyd, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed. The first keynote, by Steven Boyd, discussed convex optimization. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (&#8220;non-negative curvature&#8221; as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they care about is &#8220;beautiful theory&#8221; [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img src="http://www.bytemining.com/wp-content/uploads/2011/08/KDD_Banner_10_Jan.jpg" alt="" width="750" height="85" /></p>
<p><a href="http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/">&lt;&lt; My review of Day 1.</a></p>
<p>I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.</p>
<p><strong>Keynotes</strong></p>
<p>KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year&#8217;s conference had a few big names.</p>
<div><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/08/stephen_boyd.jpg" alt="" /><br />
<em><a href="http://www.stanford.edu/~boyd/">Steven Boyd</a>, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed. </em>The first keynote, by Steven Boyd, discussed <a href="http://en.wikipedia.org/wiki/Convex_optimization">convex optimization</a>. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (&#8220;non-negative curvature&#8221; as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they care about is &#8220;beautiful theory&#8221; so his comment was humorous to me. Convex optimization is a very intuitive way to think about regression and techniques such as the lasso. Convex optimization has tons of use cases including parameter estimation (<a href="http://en.wikipedia.org/wiki/Maximum_likelihood_estimator">MLE</a>, <a href="http://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation">MAP</a>, <a href="http://en.wikipedia.org/wiki/Least_squares">least-squares</a>, <a href="http://en.wikipedia.org/wiki/Lasso_%28statistics%29#LASSO_method">lasso</a>, logistic SVM and modern <a href="http://pages.cs.wisc.edu/~gfung/GeneralL1/">L1 optimization</a>). Boyd showed an example of convex optimization for <a href="http://www2.cs.uregina.ca/~hamilton/courses/330/notes/io/node6.html">disk head scheduling</a>.</p>
<p>For more information about convex optimization, see the website for<a href="http://www.stanford.edu/~boyd/cvxbook/"> <em>Convex Optimization </em>by Boyd and Vandenberghe</a>. The book is available for free as well as lecture slides etc. Even better, the second author is from UCLA! I did not realize that.</div>
<p><br clear="all" /></p>
<div><img class="rfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/08/norvig.jpg" alt="" /><br />
<a href="http://norvig.com/"><em>Peter Norvig</em></a>, <em>Internet Scale Data Analysis</em>. It is always great to hear from Peter Norvig. At the very least, you may have seen his name on your Artificial Intelligence introductory textbook <a href="http://aima.cs.berkeley.edu/"><em>Artificial Intelligence: A Modern Approach</em></a>. Norvig is also well known as the Director of Research at Google. He also spoke at SciPyCon 2009 and was wearing a similarly flashy shirt. Norvig discussed how to get around long latencies in a large scale system. Interestingly, his talk began with a discussion about Google&#8217;s interest in its carbon footprint because of course all of Google&#8217;s massive systems require a lot of power. The carbon output of 2500 queries is approximately equal to the carbon output in a beer. Norvig noted that most of Google&#8217;s most successful engineers are well-versed in distributed systems, and this should come as no surprise. He then introduced MapReduce and showed an example of how Google uses MapReduce to process map tiles for Google Maps. Norvig concluded by mentioning a variety of large systems used by Google including <a href="http://labs.google.com/papers/bigtable.html">BigTable</a> (column oriented store), and <a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html">Pregel</a> for graph processing. Pregel is vertex based, and thus programs &#8220;think like a vertex&#8221; where each vertex responds to actions transmitted over edges.</div>
<p><br clear="all"/><br />
(There was a keynote by a fellow named <a href="http://www.cbse.ucsc.edu/people/haussler">David Haussler</a> about cancer genomics. After an exhausting first two days, I skipped this talk as I needed to sleep&#8230;and I was not incredibly interested in the topic.)</p>
<div><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/08/Judea_Pearl.jpg" alt="" width="233px" /><em><a href="http://bayes.cs.ucla.edu/jp_home.html">Judea Pearl</a>, The Mathematics of Causal Inference. </em>Go Bruins! Judea Pearl is a professor at the <a href="http://www.cs.ucla.edu">UCLA Department of Computer Science</a> and teaches a course on his field, Causality, each spring. His talk was essentially the same talk he gives at UCLA at the beginning of the quarter. I attempted to take his course in 2009, but quite frankly, I don&#8217;t get it and my mind cannot bend into that realm. I remember sitting in his class and wondering &#8220;what is wrong with me?&#8221; I love listening to Dr. Pearl speak only because of his sense of humor. Despite his age and the fact that he is slowing down, he had the crowd in hysterics as he struggled with the presentation technology and made intelligent jokes at every chance.</p>
<p>Pearl believes that humans do not communicate with probability, but causality (I do not agree with this entirely). I appreciated that he mentioned that it takes work to overcome the difference in thinking between probability and causality. In statistics, we use some data and a joint distribution to make inferences about some quantity or variable <em>P</em>. In causality, there is an intentional intervention that changes the joint distribution <em>P </em>into another joint distribution <em>P&#8217;</em>. Causality requires new language and mathematics (I do not see it). In order to use causality, one must introduce some untestable hypothesis. Pearl mentioned that some non-standard mathematical methods include counterfactuals and structural equation modeling. I do not know how I feel about any of this. <a href="http://bayes.cs.ucla.edu/BOOK-2K/">For more information about Pearl&#8217;s Causality, check out his book</a>.</div>
<p><br clear="all" /><br />
<strong>Data Mining Competitions</strong></p>
<p>One interesting event during KDD 2011 was the panel <em>Lessons Learned from Contests in Data Mining. </em>This panel featured Jeremy Howard (<a href="http://www.kaggle.com">Kaggle</a>), Yehuda Koren (<a href="http://www.yahoo.com">Yahoo</a>!),&nbsp; Tie-Yan Liu (<a href="http://www.microsoft.com">Microsoft</a>), and Claudia Perlich (<a href="http://media6degrees.com/">Media6Degrees</a>). Both Kaggle and Yahoo <em>run </em>data mining competitions: Kaggle has its own series of competitions and Yahoo is a major sponsor of the <a href="http://www.sigkdd.org/kddcup/">KDD Cup</a> competition. Perlich has participated and won many data mining competitions. Liu provided a different insight into data mining competitions as an industry observer. <strong>&nbsp;</strong></p>
<p>Jeremy Howard gave some insight into the history of data mining competitions. He credited KDD 97 with the formation of the first data mining competition. He announced to the crowd that companies spend 100 billion dollars every year on data mining products and services (not including in-house costs such as employment) and that there are approximately 2 million Data Scientists. The estimate of the number of Data Scientists was based on the number of times R was downloaded, and is an estimate based on David Smith&#8217;s (Revolution Computing) blog post. I love R, and every Data Scientist should use it, but there are several problems with this estimate. Not everyone that uses R is a Data Scientist; a large portion of R users are statisticians (&#8220;beautiful theory&#8221;), teachers, miscellaneous students etc. Second, not all Data Scientists use R. Some are even more creative and write their own tools or use little-adopted software packages. There are also a lot of Data Scientists that use Python instead of R. Howard also announced that over the next year, Kaggle with be starting 1000s of &#8220;invitation only&#8221; competitions. Personally, I do not care for this type of exclusion even though their intentions are good.</p>
<p>Yehuda Koren introduced the crowd to Yahoo&#8217;s involvement in data mining competitions. Yahoo is a major force behind the KDD Cup and the <a href="http://www.heritagehealthprize.com/c/hhp">Heritage Foundation competition</a>. Yahoo also won a progress award in the Netflix challenge. Koren then described how data mining competitions help the community. Competitions raise awareness and attract research to a field, end up involving the release of a cool dataset to the community, encourage contribution and education, and provide publicity for participants and winners. Contestants are attracted to competitions for various reasons including fun, competitiveness, fame, the desire to learn more, peer pressure and of course the monetary reward. As with every competition, data mining competitions have rules and Koren stated that rules are very difficult to enforce. I believe that data mining is vague as it is, so competitions would be just as vague. It is important to maximize participation by minimizing the reduction of participation while maximizing fairness and innovation. Some such &#8220;rules&#8221; include discouraging huge ensembles (which probably overfit anyway), submission frequency, team duplication, team size (the KDD Cup winning team had 25 members). Some obvious keys to success in data mining competitions are ensembles, hard work, team size, innovation vs. fancy models, quick coding and patience.</p>
<p>I felt that Tie-Yan Liu from Microsoft sort of served as the Simon Cowell of the panel, and I feel that his role was necessary. He provided industry insight that provided a bit of a reality check as to what data mining competitions accomplish and do not accomplish. Liu questions if the problems being solved in data mining competitions are really important problems. Part of the problem is that many datasets are censored as to protect privacy. Additionally, the really interesting problems cannot be opened to the public because they involve trade secrets. I consider myself an inclusive guy &#8211; I do not like the concept of winners and losers. I was elated that Liu brought up this point: &#8220;what about the losers?&#8221; Is it bad publicity to &#8220;lose&#8221; several (or all) competitions? The answer to this question varies person-to-person. I honestly believe that the goal of these competitions is of the open-source nature (fun, share, learn, solve) and not so much to cure cancer. They are great for college students, people that are interested in data science but do not have access to great data. For the rest of us, learning on our own using interesting data is probably better.</p>
<p>Claudia Perlich (Media6Degrees) discussed her experience participating in data mining competitions. She has won several contests. She commented on the distinction between sterile/cleaned data and real data as competitions can include either type. The concept of Occam&#8217;s Razor applies to data mining competitions; Perlich won most of her competitions using a linear model, but by using more complex and creative features. Perlich emphasizes that complex features are better than complex models.</p>
<p>Considering the <a href="http://www.netflixprize.com/">Netflix Prize</a> has been one of the biggest data mining competitions, I was disappointed that they were not represented on the panel since there were several researchers from Netflix at the conference.</p>
<p><em>Rather than write a few sentences for each topic, I will just bullet the goals of the research discussed in the sessions. Descriptions with a star (*) denote my favorite papers and are cited later.</em></p>
<p><strong>Text Mining</strong></p>
<p>I attended two of the three text mining sessions. I must say that I am quite topic-modeled and LDAed out! <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_Allocation">Latent Dirichlet Allocation (LDA)</a> and several variations were part of every talk I heard. That was very exciting and reaffirms that I am in a hot field. Still, nobody has taken my dissertation topic yet (which I have remained quiet about).</p>
<ul>
<li>Using explicit user feedback to improve LDA and display topics appropriately by combining topic labels, topic n-grams and capitalization/entity detection.* This talk was presented by David Andrzejwski (@<a href="http://www.twitter.com/davidandrzej">davidandrzej</a>). I finally got to meet him and I discussed my dissertation topic with him. I am always entertained by the fact that we all look much different than our Twitter avatars portray. <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
<li>Using external metadata and topics (LDA) to predict user ratings on items using localized factor models.</li>
<li>Using preferences and relative emphasis of each factor (i.e. how important to you is free wireless Internet in a hotel room?) to predict rating scores.*</li>
<li>Determining the network process that created a piece of text: who copied from whom?</li>
<li>Using a topic model (LDA) with other features such as part-of-speech tag (noun, verb etc.), <a href="http://wordnet.princeton.edu/">WordNet</a> features, sentiment/polarity etc.*</li>
<li>Modeling how topics and interests grown over time and understanding the correlations between terms over time.*</li>
</ul>
<p><strong>Social Network Analysis and Graph Analysis</strong></p>
<p>The Social Networks session conflicted with one of the Text Mining sessions, but since I knew there would be two more, I decided to attend this one instead. I also combined the two Graph Analysis sessions into this section since they are so related. The goals of the research presented in these talks were as follows:</p>
<ul>
<li>To label venue (Foursquare venues etc.) types (restaurant, bar, park etc.) based on several attributes of the user: user&#8217;s friends, user&#8217;s weekly and daily schedule using label propagation.</li>
<li>To determine the connections/edges in a social network that are the most critical for propagation of data (an idea, tweet, viral marketing etc.)*</li>
<li>To use tagging (items on Amazon can be tagged with keywords by users) and reviews to predict the success of a new item.</li>
<li>To find a better metric for ranking search engine results by starting with a relevant subgraph rather than a random surfer model. Also models attention span of user.*</li>
<li>Classification of nodes, labeling of nodes and node link prediction using one unified algorithm (C3).*</li>
<li>Ranking using large graphs using a priori information about good/bad nodes and edges.*</li>
<li>The importance of bias in sampling from networks.*</li>
</ul>
<p><strong>User Modeling</strong></p>
<p>This session I suspect was similar to the Web User Modeling session and focused on recommendation engines and rating prediction.</p>
<ul>
<li>Using endorsements to measure user bias (retweets, likes, etc.) to perform real time sentiment analysis,</li>
<li>Estimating user reputation using thumbs-up vote rates on Yahoo News comments.</li>
<li>Selecting a set of reviews that encapsulates the most information about a product with the most diverse viewpoints.</li>
</ul>
<p><strong>Frequent Sets<br />
</strong></p>
<p>I did some work with <a href="http://en.wikipedia.org/wiki/Association_rule_learning">itemset mining</a> at my last job and I was not incredibly interested in the Online Data and Streams session at the time so I attended this talk.</p>
<ul>
<li>Using background knowledge about transactions to minimize redundancy.</li>
<li>Studying the effects of order on itemset mining.</li>
<li>Mining graphs as frequent itemsets from streams.</li>
</ul>
<p><strong>Classification</strong></p>
<p>I got stuck in this session because the session I really wanted to attend &#8220;Web User Modeling&#8221; was full and there was nowhere to sit or stand. This session was more technical and theoretical. The only session that I really enjoyed was about <a href="http://dl.acm.org/ft_gateway.cfm?id=2020418&amp;ftid=1012890&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979">a classifier called CHIRP. I did not follow the details, but this is a paper that I am interested in reading</a>. The authors used a classifier based on Composite Hypercutes on Interated Random Projections to classify spaces that have complex topology (think of classifying items that appear in a bullseye/dartboard pattern).*</p>
<p><strong>Unsupervised Learning</strong></p>
<p>This talk was similar to the classification talk but more practical in my opinion.</p>
<ul>
<li>Using decision trees for density estimation classifiers.</li>
<li>Clustering cell phone user behavior using &#8220;Earth Mover&#8221; distance.</li>
<li>Clustering of multidimensional data using mixure modeling with components of different distributions and copulas.*</li>
</ul>
<p><strong>Favorite Papers</strong></p>
<p>Below is a short bibliograph of papers that were my favorite. There were also a few at the poster session (the first four) that I include here.<strong><br />
</strong></p>
<ul>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020603&amp;ftid=1013043&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Ranking-Based Classification of Heterogeneous Information Networks</em></a>, Ming Ji, Jiaewi Han, Marina Danilevsky.</li>
<li><em><a href="http://dl.acm.org/ft_gateway.cfm?id=2020561&amp;ftid=1013057&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979">Axiomatic Ranking of Network Role Similarity</a>, </em>Ruomong Jin, Victor E. Lee, Hui Hong.</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020558&amp;ftid=1013004&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Approximate Kernel k-means: Solutions to Large Scale Kernel Clustering</em></a></li>
<li><em><a href="http://dl.acm.org/ft_gateway.cfm?id=2020614&amp;ftid=1013054&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979">User-Level Sentiment Analysis Incorporating Social Networks</a>, </em>Chenhao Tan, Lillian Lee, Jie Tang, Lang Jiang, Ming Zhou, Ping Li.</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020503&amp;ftid=1012957&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Latent Topic Feedback for Information Retrieval</em></a>, David Andrzejewski, Lawrence Livermore National La; David Buttler, Lawrence Livermore National Laboratory</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020505&amp;ftid=1012959&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Latent Aspect Rating Analysis without Aspect Keyword Supervision</em></a>, Hongning Wang, UIUC; Yue Lu, University of Illinois; ChengXiang Zhai, UIUC</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020484&amp;ftid=1012943&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Conditional Topical Coding: an Efficient Topic Model Conditioned on Rich Features</em></a>, Jun Zhu, Carnegie Mellon University; Ni Lao, Carnegie Mellon University; Ning Chen, Tsinghua University; Eric Xing, CMU</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020485&amp;ftid=1012944&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Tracking Trends: Incorporating Term Volume into Temporal Topic Models</em></a>, Liangjie Hong, Lehigh University; Dawei Yin, lehigh University; Jian Guo, University of Michigan; Brian Davison, Lehigh University</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020428&amp;ftid=1012898&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Diversity in ranking via resistive graph centers</em></a>, Kumar Dubey, IBM Research; Soumen Chakrabarti, &#8220;Indian Institute of Technology, Bombay&#8221;; Chiru Bhattacharya, IISc</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020429&amp;ftid=1012899&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Collective Graph Identification</em></a>, Galileo Namata, University of Maryland; Stanley Kok, University of Maryland; Lise Getoor, &#8220;University of Maryland, College Park&#8221;</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020430&amp;ftid=1012900&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Semi-Supervised Ranking on Very Large Graph with Rich Metadata</em></a>, Bin Gao, Microsoft Research Asia; Tie-Yan Liu, Microsoft Research Asia; Wei Wei, ; Taifeng Wang, Microsft research; Hang Li, Microsoft</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020431&amp;ftid=1012901&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Benefits of Bias: Towards Better Characterization of Network Sampling</em></a>, Arun Maiya, UIC; Tanya Berger-Wolf, University of Illinois at Chicago</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020418&amp;ftid=1012890&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections</em></a>, Leland Wilkinson, Systat; Anushka Anand, UIC; Tuan Dang, UIC</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020492&amp;ftid=1012949&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Sparsification of Influence Networks</em></a>, Michael Mathioudakis, University of Toronto; Francesco Bonchi, Yahoo! Research; Carlos Castillo, Yahoo!; Aristides Gionis, Yahoo! Research Barcelona; Antti Ukkonen,</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020509&amp;ftid=1012962&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Online heterogeneous mixture modeling with marginal and copula selection</em></a>, RYOHEI FUJIMAKI, NEC Laboratories America; Yasuhiro Sogawa, ; Satosi Morinaga,</li>
</ul>
<p><strong>Wrapping Up</strong></p>
<p>I had an awesome time at KDD and wish I could go next year, but it will be held in Beijing. I got to meet a lot of different people in the field that have the same passion for data and that was really cool. I got to meet with recruiters from a few different companies and get some swag from Yahoo and Google.</p>
<p>It was awesome being around such greatness. I ran into Peter Norvig several times, ran into Judea Pearl in the restroom (I already know him), as well as <a href="http://www.cs.cmu.edu/~christos/">Christos Faloutsos</a> (I am a huge fan) and <a href="http://www.rulequest.com/Personal/">Ross Quinlan</a>. I stopped at the Springer booth and found a <a href="http://www.amazon.com/gp/product/1441965149">cool book about link prediction with Faloutsos as one of the authors</a>. I went to buy it, handed the lady my credit card, and learned that it was $206 (AFTER conference discount)! Interestingly&#8230; Amazon has the same book for $165. I will probably order it anyway.</p>
<p>Here&#8217;s hoping that KDD returns to California (or the US) real soon!
</p>
<p style="text-align: center;"><img src="http://www.bytemining.com/wp-content/uploads/2011/08/sd.jpg" alt="" /></p>
<p><a href="http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/">&lt;&lt; My review of Day 1.</a></p>
<p><strong>Candid Shots</strong></p>
<table border="0">
<tr>
<td>
<img src="http://www.bytemining.com/wp-content/uploads/2011/08/quinlan.jpg" alt="" />
</td>
<td>
<img src="http://www.bytemining.com/wp-content/uploads/2011/08/faloutsos.jpg" alt="" />
</td>
</tr>
<tr>
<td>Ross Quinlan enjoying a beer during the poster session. What a cool guy!</td>
<td>Christos Faloutsos talking with a student during the poster session.</td>
</tr>
</table>
<div class="shr-publisher-917"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F08%2Fsigkdd-2011-conference-days-234-summary-3%2F' data-shr_title='SIGKDD+2011+Conference+--+Days+2%2F3%2F4+Summary'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/08/sigkdd-2011-conference-days-234-summary-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGKDD 2011 Conference &#8212; Day 1 (Graph Mining and David Blei/Topic Models)</title>
		<link>http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/</link>
		<comments>http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/#comments</comments>
		<pubDate>Mon, 22 Aug 2011 16:41:22 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=877</guid>
		<description><![CDATA[<p></p>
<p>I have been waiting for the KDD conference to come to California, and I was ecstatic to see it held in San Diego this year. AdMeld did an awesome job displaying KDD ads on the sites that I visit, sometimes multiple times per page. That&#8217;s good targeting!</p>
<p>Mining and Learning on Graphs Workshop 2011</p>
<p>I had originally planned to attend the 2-day workshop Mining and Learning with Graphs (MLG2011) but I forgot that it started on Saturday and I arrived on Sunday. I attended part of MLG2011 but it was difficult to pay attention considering it was my first time waking up at 7am in a long time. The first talk I arrived for was Networks Spill the Beans by Lada Adamic from the University of Michigan. Adamic&#8217;s presented work involved inferring properties of content (the &#8220;what&#8221;) using network structure alone (using only the &#8220;who&#8221;: who shares with whom). One example she presented involved questions and answers on a Java programming language forum. The research problem was to determine things such as who is most likely to answer a Java beginner&#8217;s question: a guru, or a slightly more experienced user? Another research question asked what dynamic interactions tell us about information flow. [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img src="http://www.bytemining.com/wp-content/uploads/2011/08/KDD_Banner_10_Jan.jpg" alt="" width="750" height="85" /></p>
<p>I have been waiting for the KDD conference to come to California, and I was ecstatic to see it held in San Diego this year. <a href="http://www.admeld.com">AdMeld</a> did an awesome job displaying KDD ads on the sites that I visit, sometimes multiple times per page. That&#8217;s good targeting!</p>
<p><strong>Mining and Learning on Graphs Workshop 2011</strong></p>
<p>I had originally planned to attend the 2-day workshop <a href="http://www.cs.purdue.edu/mlg2011/">Mining and Learning with Graphs (MLG2011)</a> but I forgot that it started on Saturday and I arrived on Sunday. I attended part of MLG2011 but it was difficult to pay attention considering it was my first time waking up at 7am in a long time. The first talk I arrived for was <em>Networks Spill the Beans </em>by <a href="http://www.ladamic.com/">Lada Adamic</a> from the University of Michigan. Adamic&#8217;s presented work involved inferring properties of content (the &#8220;what&#8221;) using network structure alone (using only the &#8220;who&#8221;: who shares with whom). One example she presented involved questions and answers on a Java programming language forum. The research problem was to determine things such as who is most likely to answer a Java beginner&#8217;s question: a guru, or a slightly more experienced user? Another research question asked what dynamic interactions tell us about information flow. For this example, Adamic used data from the virtual world <a href="http://secondlife.com/">SecondLife</a>. Certain landmarks (such as a bench) can be bookmarked by users and certain gestures (like a kiss) can be studied. This made my ears rise. SecondLife is a treasure trove of cool data. Is there a way to access it? It looks there might be a way to access some of it including monetary valuation, market purchases, and several APIs for different aspects of SecondLife.&nbsp; I will have to look into that later though. Adamic concluded with a discussion of Twitter as a social network, but I was starting to fall asleep from my hectic and early morning. The gist of her talk, and many other talks in this field, was to combine semantic variables (NLP) with topological variables (SNA) to predict som other semantic variables. This talk was very digestible, and very interesting (despite my lack of sleep), but featured some of the worst visualizations I have ever seen (area plots representing correlations across multiple levels of an ordinal variable), but that was minor. Of course, <a href="http://www.flowingdata.com">Nathan</a> might disagree <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .</p>
<p>Social network analysis, and network analysis in general, is a field that I really want to sink my teeth into. The difficulty I have is that the discussion of this field seems to involve so much vernacular that is specific to the field that everything seems so much more difficult than it really is.</p>
<p>At this point I took off to lunch. Just across from the <a href="http://manchestergrand.hyatt.com/hyatt/hotels/index.jsp?null">Hyatt</a> (a beautiful hotel by the way) is <a href="http://www.seaportvillage.com/">Seaport Village</a>, a beautiful waterfront park containing nice landscaping, shops, restaurants, all with the ocean in the background. There is no beach there &#8212; the village backs right up to the water. Across the bay is some type of military complex and <a href="http://www.coronado.ca.us/">Coronado Island</a>. I had a $7 hot dog, followed by a chocolate-covered strawberry and a peanut butter cup from the nearby candy store. It was such a nice day I walked around for a while, grabbed a strawberry shake and then headed back for the next session&#8230; the one I had been waiting for!</p>
<div align="center">
<table>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/08/2011-08-21_12-52-40_687.jpg" width="288px"/></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/08/2011-08-21_12-38-45_574.jpg" width="288px"/></td>
</tr>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/08/2011-08-21_10-33-32_593.jpg"/></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/08/2011-08-21_12-57-35_758.jpg" width="288px"/></td>
</tr>
</table>
</div>
<p><strong>Afternoon Tutorial: Probabilistic Topic Models </strong><br />
<em>David Blei, Princeton</em></p>
<p>My dissertation topic is related to <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">Latent Dirichlet allocation</a> (well, topic modeling in general), so I was definitely interested to hear what the father of LDA had to say. Since this was a 3 hour tutorial, I was expecting that <a href="http://www.cs.princeton.com/~blei">Blei</a> would start with the unigram model, and then discuss <a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis">Latent Semantic Analysis (LSA)</a> and <a href="http://en.wikipedia.org/wiki/PLSI">Probabilistic Latent Semantic Indexing (pLSI)</a> building up to LDA. Instead, Blei started with LDA and for good reason! In this post, I will not summarize the mechanics of Latent Dirichlet Allocation as that is another post entirely. For some introduction, see <a href="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf">here</a>. LDA and its extensions can be used to model the evolution of topics over time, to model the connections among topics, and to predict links among objects in a network. Topic modeling is a case study in machine learning rather than a field in itself; topic modeling draws on several different concepts including Bayesian statistics, time series analysis, hierarchical models, Markov chain monte carlo (MCMC), Bayesian non-parametric statistics and sparsity. In LDA, a document is represented as a mixture of topics (some hypothetical quantity that captures content clustering), and a topic is a distribution over words in a vocabulary.</p>
<p>Again, this is a high-level description of what was discussed. A full mechanical analysis would require dozens of pages. LDA is just a probabilistic model. As such, there are established ways for estimating the parameters of the model as well as the topic assignments. Some of these include <a href="http://www.springerlink.com/index/N811M25287935571.pdf">mean field variational methods</a>, <a href="http://research.microsoft.com/en-us/um/people/minka/papers/ep/">expectation propagation</a>, <a href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs sampling</a>, collapsed Gibbs sampling, collapsed variational Bayes and online variational Bayes. Each of these estimation methods has its own advantages and disadvantages. Blei showed the LDA and pLSI have a lot in common. Unlike LDA, pLSI uses <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">maximum likelihood estimations</a> (and the <a href="http://en.wikipedia.org/wiki/Expectation-maximization_algorithm">EM algorithm</a>) for parameter estimation; pLSI tends to overfit badly. The hyperparameter &alpha; adds regularization to the ϴ parameter in the LDA model. [Sorry to refer to these random parameters, but it is difficult to describe without them. See the links mentioned earlier for an overview of LDA.]</p>
<p><em>Preprocessing. </em>A lot of preprocessing must be performed before computing a topic model. First, we should <a href="http://en.wikipedia.org/wiki/Stop_words"><strong>remove stopwords</strong></a>, which are words that provide absolutely no clues to the content of the text. If we leave stopwords in the corpus when computing the model, we may end up with meaningless topics that are described with only stopwords, due to their high probability. Second, Blei mentioned that <a href="http://en.wikipedia.org/wiki/Stemming"><strong>stemming</strong></a> is a good idea, but modern stemming algorithms tend to be too aggressive. If resources allow, I think it would be useful to have humans manually strip words to their root words. <strong>Multiword phrases </strong>such as &#8220;black hole&#8221; are also an issue. With sufficient resources, one could ask human labelers to identify these phrases and recode them as a single word by replacing the space between words with an underscore. <a href="http://www.cs.umass.edu/~wallach/publications/wallach06topic.pdf">Hanna Wallach (U. Mass) has a paper</a> that describes how to identify multiwork phrases by using <em>n</em>-grams. Blei has a similar paper that discusses an algorithm called <a href="http://arxiv.org/abs/0907.1013">TurboTopics</a>. He also mentioned that a standard statistical hypothesis test such as chi-squared, permutation tests, or a nested hypothesis test would also be sufficient, though inefficient. I have not thought of how this would work however. Finally, <strong>remove rare words</strong> because they can lead to local optima in the likelihood surface probably yielding inefficient computation.</p>
<p><em>Some hairy details. </em>One of the parameters that makes LDA useful is <em>&nbsp;</em>&alpha;. &alpha; is a hyperparameter in the LDA model that determines the sparsity of draws from the underlying <a href="http://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet distribution</a>. &alpha; is typically a small number; Blei mentioned that 0.01 is a good a priori value for &alpha;. As &alpha; gets larger, the distribution of topics tends towards the uniform (each topic equally likely) distribution and as &alpha; approaches 0, we get sparser draws, meaning more peaked topic probabilities. Setting &alpha; to be ridiculously small (i.e. 0.001) may yield a single topic dominating the model. &alpha; can be chosen, or we can fit &alpha; to the data using cross-validation or some other method. He also discussed the parameter &eta;.</p>
<p><em>Open source software. </em>We quickly (flash of an eye) went through a list of some open-source LDA implementations:</p>
<ul>
<li><a href="http://www.cs.princeton.edu/~blei/lda-c/">LDA-C</a> (variational EM), <em>Blei.</em></li>
<li><a href="http://www.cs.princeton.edu/~blei/topicmodeling.html">HDP</a> (hierarchical Dirichlet processes), <em>Blei</em></li>
<li><a href="http://cran.r-project.org/web/packages/lda/">LDA</a> (R package, collapsed Gibbs), <em>Jonathan Chang, </em>Data Scientist, Facebook.</li>
<li><a href="http://alias-i.com/lingpipe/">Lingpipe</a>, <em>alias-i</em></li>
<li><a href="http://mallet.cs.umass.edu/">Mallet</a> (collapsed Gibbs), <em>UMass</em></li>
<li><a href="http://nlp.fi.muni.cz/projekty/gensim/">Gensim</a> (online and batch LDA),  <em>Radim Řehůřek</em></li>
</ul>
<p>To my delight, Blei seemed to favor the R package (although Gensim is a nice Python implementation). The R package not only contains LDA, but several other models including RTMs, MMSB and sLDA which will be discussed later. It is supposedly fast as well. The output from the R package can be visualized using the Topic Model Visualizer by Allison Chaney.</p>
<p>The beauty of LDA is that it can be embedded in many more complicated models. Some applications of these extensions include word sense, graphs and hierarchies. Before delving into specifics, there are a couple of changes to the LDA model that motivate the next topics.</p>
<ol>
<li>The probability of observing word <em>w </em>given a set of topi<span style="font-family: arial,helvetica,sans-serif;">c</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">s</span> &beta; <span style="font-family: arial,helvetica,sans-serif;">and a set of topic labels z is given by <em>P(w|</em></span></span><span style="font-family: symbol;">&beta;</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>,z)</em></span><span style="font-family: arial,helvetica,sans-serif;"> which is <a href="http://en.wikipedia.org/wiki/Multinomial_distribution">multinomial</a>. The distribution of </span></span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>P(w|</em></span></span><span style="font-family: symbol;">&szlig;</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>,z) </em></span></span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">can be changed depending on what we are modeling. For example</span></span>, for count data, <span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>P(w|</em></span></span><span style="font-family: symbol;">&beta;</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>,z) </em>can be <a href="http://en.wikipedia.org/wiki/Poisson_distribution">Poisson</a>. This drastically changes the model, however. In LDA, </span></span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>P(w|</em></span></span><span style="font-family: symbol;">&szlig;</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>,z) </em>is multinomial which is convenient because it is the <a href="http://en.wikipedia.org/wiki/Conjugate_prior">conjugate prior</a> of the Dirichlet distribution.</span></span></li>
<li><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">The characteristic LDA posterior distribution can be used in more creative ways&#8230;</span></span></li>
</ol>
<p><em><a href="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aoas/1183143727">Correlated Topic Model</a>. </em>In LDA, all topics are considered independent of each other, and this is usually unrealistic. CTM allows the topics to be correlated. For example, a paper classified as about calculus is more likely to also be classified as about physics, than it is to be classified as about sewing. Blei mentioned that CTM allows for better prediction, likely because it is more realistic. CTM is also more robust to overfitting. The main distinction from LDA is that <span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>ϴ </em>follows the logistic normal distribution instead of the Dirichlet distribution.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em><a href="http://dl.acm.org/citation.cfm?id=1143859">Dynamic Topic Model</a>. </em>DTM models how each individual topic changes over time. One example Blei showed involved a topic that could be labeled &#8220;technology&#8221;. In the late 1700s, this topic contained the words &#8220;coal&#8221;, &#8220;steel&#8221; (I am making it up from memory&#8230;probably badly&#8230;bear with me) and in 2011 contained the words &#8220;silicon&#8221; and &#8220;solar&#8221;. The main distinction from LDA is two-fold: assuming the topic at time <em>t </em>is normally distributed with the topic at time <em>t-1 </em>as the mean and some variance. That is,</span></span></p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cbeta_%7Bt%2Ck%7D%20%5Cvert%20%5Cbeta_%7Bt-1%2Ck%7D%20%5Csim%20N%28%5Cbeta_%7Bt-1%2Ck%7D%2C%20I%5Csigma%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\beta_{t,k} \vert \beta_{t-1,k} \sim N(\beta_{t-1,k}, I\sigma^2' title='\beta_{t,k} \vert \beta_{t-1,k} \sim N(\beta_{t-1,k}, I\sigma^2' class='latex' />
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">and </span></span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em></em></span></span></p>
<img src='http://s.wordpress.com/latex.php?latex=P%28w%20%5Cvert%20%5Cbeta_%7Bt%2Ck%7D%29%20%5Cpropto%20%5Cexp%5Cbeta_%7Bt%2Ck%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(w \vert \beta_{t,k}) \propto \exp\beta_{t,k}' title='P(w \vert \beta_{t,k}) \propto \exp\beta_{t,k}' class='latex' />
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">instead of multinomial.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">A limitation of DTM is that it does not handle the death of a topic gracefully. </span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><a href="http://arxiv.org/abs/1003.0783"><em>Supervised LDA</em></a>. In sLDA, we associate each document with an external variable. For example, a document may be a Yelp review containing text. The external variable associated with the Yelp review may be the number of stars in the associated rating. We can use sLDA to use the topics estimated by LDA as regressors to predict this external variable <em>Y</em>. Various types of regression can be performed from standard linear regression to the <a href="http://en.wikipedia.org/wiki/Generalized_linear_model">generalized linear model (GLM)</a>.The Yelp example would likely use an <a href="http://en.wikipedia.org/wiki/Ordered_logit">ordered logit</a> model for <em>Y</em>.<br />
</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em><a href="https://www.cs.princeton.edu/~blei/papers/ChangBlei2009.pdf">Relational Topic Models</a>. </em>RTM applies sLDA to every pair of documents in a corpus and attempts to use content to predict connectedness in a graph. For example, given the content on my Facebook profile, one could use sLDA to predict what kind of reaction I would have to an ad (i.e. click or no click) and this could be used for targeted ad serving, or any other type of recommendation engine. Think <a href="http://en.wikipedia.org/wiki/Collaborative_filtering">collaborative filtering</a>! RTM is also good for certains types of data that have spatial/geographic dependencies. </span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em><a href="http://www.cs.umass.edu/~wallach/workshops/nips2010css/papers/gerrish.pdf">Ideal Point Topic Models</a> </em>were barely touched upon since we were running short on time (although we voted to extend the session by 30 mins and Blei happily obliged). They seem particularly useful in political science for predicting roll call votes.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>Bayesian Non-Parametric Models</em> are a hot topic but are too complicated to describe here. In LDA, the number of topics is determined a priori and remains fixed throughout the model. In real life, topics can be &#8220;born&#8221; and can &#8220;die&#8221; off and we may not know a priori how many topics to use. One can model the latter situation as a <a href="http://en.wikipedia.org/wiki/Chinese_restaurant_process"><em>Chinese Restaurant Process</em></a> where each table is associated with a topic. Furthermore, a <em>Chinese Restaurant Franchise </em>can be used for modeling hierarchies (hLDA). In CRF, there is a corpus level restaurant where each table is a parameter and a topic (called plates). Then, each document has its own Chinese restaurant where each table is associated with a customer in the corpus level Chinese restaurant. <a href="http://www.amazon.com/Nonparametrics-Cambridge-Statistical-Probabilistic-Mathematics/dp/0521513464/ref=sr_1_2?ie=UTF8&amp;qid=1314031811&amp;sr=8-2">Blei recommended a book by Hjort</a>.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><strong>Algorithms. </strong>The last few minutes were dedicated to discussing inference algorithms for LDA, particularly Gibbs sampling and variational Bayes. Gibbs sampling is very simple to implement, though Blei stated that it does not work for DTM or CTM because the assumptions of conjugacy (multinomial/Dirichlet) are violated. Variational Bayes is more difficult to implement, but handles non-conjugacy in CTM and DTM much better.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><strong>Plenary Sessions</strong></span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">The plenary sessions consisted of several thank-yous and awards. The committee provided some humor which gave some humility to the long process of writing and submitting papers. They went over paper acceptance statistics and read some of the funnier comments that reviewers gave, one of which was something like &#8220;It is clear that the author did not read this paper before submitting it.&#8221; I don&#8217;t know how many times I have said that in various situations. The committee handed out awards for best paper and best dissertation. This year&#8217;s <a href="http://www.kdd.org/kdd2011/kddcup.shtml">KDD Cup competition was a contest similar to the Netflix challenge</a>, but involved music recommendation. The winner was the <a href="http://www.ntu.edu.tw/engv4/">National Taiwan University</a>, for the fourth straight year in a row I am told. The innovation award went to a researcher dear to my heart, <a href="http://en.wikipedia.org/wiki/Ross_Quinlan">Ross Quinlan</a>, who developed the <a href="http://www.rulequest.com/Personal/">C4.5 decision tree modeling software</a>.<strong><br />
</strong></span></span></p>
<p><emph>For more information about topic modeling software, see David Blei&#8217;s website at <a href="http://www.cs.princeton.edu/~blei">http://www.cs.princeton.edu/~blei</a> which contains code for most if not all of these topic models. For the notes from the tutorial, see <a href="http://www.cs.princeton.edu/~blei/kdd-tutorial.pdf">http://www.cs.princeton.edu/~blei/kdd-tutorial.pdf</a>.</emph></p>
<div class="shr-publisher-877"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F08%2Fsigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models%2F' data-shr_title='SIGKDD+2011+Conference+--+Day+1+%28Graph+Mining+and+David+Blei%2FTopic+Models%29'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Hadoop Fatigue &#8212; Alternatives to Hadoop</title>
		<link>http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/</link>
		<comments>http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/#comments</comments>
		<pubDate>Tue, 16 Aug 2011 17:30:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=794</guid>
		<description><![CDATA[<p>It&#8217;s been a while since I have posted&#8230; in the midst of trying to plow through this dissertation while working on papers for submission to some conferences.</p>
<p></p>
<p>Hadoop has become the de facto standard in the research and industry uses of small and large-scale MapReduce. Since its inception, an entire ecosystem has been built around it including conferences (Hadoop World, Hadoop Summit), books, training, and commercial distributions (Cloudera, Hortonworks, MapR) with support. Several projects that integrate with Hadoop have been released from the Apache incubator and are designed for certain use cases:</p>

Pig, developed at Yahoo, is a high-level scripting language for working with big data and Hive is a SQL-like query language for big data in a warehouse configuration.
HBase, developed at Facebook, is a column-oriented database often used as a datastore on which MapReduce jobs can be executed.
ZooKeeper and Chukwa 
Mahout is a library for scalable machine learning, part of which can use Hadoop.
Cascading (Chris Wensel), Oozie (Yahoo) and Azkaban (LinkedIn) provide MapReduce job workflows and scheduling.

<p>Hadoop is meant to be modeled after Google MapReduce. To store and process huge amounts of data, we typically need several machines in some cluster configuration. A distributed filesystem (HDFS for Hadoop) uses space across [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><em>It&#8217;s been a while since I have posted&#8230; in the midst of trying to plow through this dissertation while working on papers for submission to some conferences.</em></p>
<p><em></em><img src="http://www.bytemining.com/wp-content/uploads/2011/06/hadoop.png" alt="" width="335" height="93" /></p>
<p>Hadoop has become the de facto standard in the research and industry uses of small and large-scale MapReduce. Since its inception, an entire ecosystem has been built around it including conferences (<a href="http://www.hadoopworld.com/">Hadoop World</a>, <a href="http://developer.yahoo.com/events/hadoopsummit2011/">Hadoop Summit</a>), books, training, and commercial distributions (<a href="http://www.cloudera.com">Cloudera</a>, <a href="http://www.hortonworks.com">Hortonworks</a>, <a href="http://www.mapr.com">MapR</a>) with support. Several projects that integrate with Hadoop have been released from the <a href="http://incubator.apache.org/">Apache incubator</a> and are designed for certain use cases:</p>
<ul>
<li><a href="http://pig.apache.org/">Pig</a>, developed at Yahoo, is a high-level scripting language for working with big data and <a href="http://hive.apache.org/">Hive</a> is a SQL-like query language for big data in a warehouse configuration.</li>
<li><a href="http://hbase.apache.org/">HBase</a>, developed at Facebook, is a column-oriented database often used as a datastore on which MapReduce jobs can be executed.</li>
<li>ZooKeeper and Chukwa </li>
<li><a href="http://mahout.apache.org/">Mahout</a> is a library for scalable machine learning, part of which can use Hadoop.</li>
<li><a href="http://www.cascading.org/">Cascading</a> (Chris Wensel), <a href="http://yahoo.github.com/oozie/">Oozie</a> (Yahoo) and <a href="http://sna-projects.com/azkaban/">Azkaban</a> (LinkedIn) provide MapReduce job workflows and scheduling.</li>
</ul>
<p>Hadoop is meant to be modeled after <a href="http://labs.google.com/papers/mapreduce.html">Google MapReduce</a>. To store and process huge amounts of data, we typically need several machines in some cluster configuration. A distributed filesystem (<a href="http://hadoop.apache.org/hdfs/">HDFS</a> for Hadoop) uses space across a cluster to store data so that it appears to be in a contiguous volume and provides redundancy to prevent data loss. The distributed filesystem also allows data collectors to dump data into HDFS so that it is already prime for use with MapReduce. A Data Scientist or Software Engineer then writes a Hadoop MapReduce job. <em></em></p>
<p><em>As a review</em>, the Hadoop job consists of two main steps, a map step and a reduce step. There may optionally be other steps before the map phase or between the map and reduce phases. The map step reads in a bunch of data, does something to it, and emits a series of key-value pairs. One can think of the map phase as a partitioner. In text mining, the map phase is where most parsing and cleaning is performed. The output of the mappers is sorted and then fed into a series of reducers. The reduce step takes the key value pairs and computes some aggregate (reduced) set of data such as a sum, average, etc. The <a href="http://wiki.apache.org/hadoop/WordCount">trivial word count exercise</a> starts with a map phase where text is parsed and a key-value pair is emitted: a word, followed by the number &#8220;1&#8243; indicating that the key-value pair represents 1 instance of the word. The user might also emit something to coerce Hadoop into passing data into different reducers. The words and 1s are sorted and passed to the reducers. The reducers take like key-value pairs and compute the number of times the word appears in the original input.</p>
<p>After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. <strong>Before I continue, I will say that I still love Hadoop and the community.</strong></p>
<ul>
<li>Writing Hadoop jobs in Java is very time consuming because <em>everything </em>must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.</li>
<li>Documentation for the bloated Java API is sufficient, but not the most helpful.</li>
<li>HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.</li>
<li>Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!</li>
<li>Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I&#8217;ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.</li>
<li>Large clusters require a dedicated team to keep it running properly, but that is not surprising.</li>
<li>Writing a Hadoop job becomes a software engineering task rather than a data analysis task.</li>
</ul>
<p>Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I&#8217;ve often said to myself &#8220;there must be a better way.&#8221; For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:</p>
<p><strong>BashReduce</strong></p>
<p>Unlike Hadoop, <a href="https://github.com/erikfrey/bashreduce">BashReduce</a> is just a script! BashReduce implements MapReduce for standard Unix commands such as sort, awk, grep, join etc. It supports mapping/partitioning, reducing, and merging. The developers note that BashReduce &#8220;sort of&#8221; handles task coordination and a distributed file system. In my opinion, these are strengths rather than weaknesses. There is actually no task coordination as a master process simply fires off jobs and data. There is also no distributed file system at all, but BashReduce will distribute files to worker machines. Of course, without a distributed file system there is a lack of fault-tolerance among other things.</p>
<p>Intermachine communication is facilitated with simple passwordless SSH, but there is a large cost associated with transferring files from a master machine to its workers whereas with Hadoop, data is stored centrally in HDFS. Additionally, partition/merge in the standard unix tools is not optimized for this use case, thus the developer had to use a few additional C programs to speed up the process.</p>
<p>Compared to Hadoop, there is less complexity and faster development. The result is the lack of fault-tolerance, and lack of flexibility as BashReduce only works with certain Unix commands. Unlike Hadoop, BashReduce is more of a tool than a full system for MapReduce. BashReduce was developed by <a href="http://fawx.com/">Erik Frey</a> et. al. of <a href="http://www.last.fm/">last.fm</a>.</p>
<p><strong>Disco Project</strong></p>
<p><a href="http://discoproject.org/">Disco</a> was initially developed by Nokia Research and has been around silently for a few years. Developers write MapReduce jobs in simple, beautiful Python. Disco&#8217;s backend is written in <a href="http://www.erlang.org">Erlang</a>, a scalable functional language with built-in support for concurrency, fault tolerance and distribution &#8212; perfect for a MapReduce system! Similar to Hadoop, Disco distributes and replicates data, but it does not use its own file system. Disco also has efficient job scheduling features.</p>
<p>It seems that Disco is a pretty standard and powerful MapReduce implementation that removes some of the painful aspects of Hadoop, but it also likely removes persistent fault tolerance as it relies on a standard filesystem rather than one like HDFS, but Erlang may impose some functionality that provides a &#8220;good enough&#8221; level of fault tolerance for data.</p>
<p><strong>Spark</strong></p>
<p><a href="http://www.spark-project.org/">Spark</a> is one of the newest players in the MapReduce field. Its purpose is to make data analytics fast to write, and fast to run. Unlike many MapReduce systems, Spark allows <em>in-memory</em> querying of data (even distributed across machines) rather than using disk I/O. It is of no surprise then that Spark out-performs Hadoop on many iterative algorithms. Spark is implemented in <a href="http://www.scala-lang.org/">Scala</a>, a functional object-oriented language that sits on top of the JVM. Similar to other languages like Python, Ruby, and Clojure, Scala has an interactive propt and users can use Spark to query big data straight from the Scala interpreter.</p>
<p>One wrinkle is that Spark requires installing a cluster manager called <a href="http://www.mesosproject.org/">Mesos</a>. I had some difficulty installing it on Ubuntu, but the development team was an amazing help, and made a few changes to the source and now it runs well. On the downside, Mesos adds a layer of complexity that we are trying to avoid. On the upside, Mesos allows Spark to co-exist with Hadoop and it can read any data source that Hadoop supports, and it &#8220;feels&#8221; light, similar to Disco&#8217;s server UI.</p>
<p>Spark was developed by the <a href="http://amplab.cs.berkeley.edu/">UC Berkeley AMP Lab</a>. Currently, its main users are UC Berkeley researchers and <a href="http://www.conviva.com/">Conviva</a>. <a href="http://www.bytemining.com/2011/06/my-review-of-hadoop-summit-2011-hadoopsummit/">Hadoop Summit 2011 featured a talk on Spark by one of the developers, which I wrote about earlier this summer</a>.</p>
<p><strong>GraphLab</strong></p>
<p><a href="http://graphlab.org/">GraphLab</a> was developed at <a href="http://www.cmu.edu">Carnegie Mellon</a> and is designed for use in machine learning. GraphLab&#8217;s goal is to make the design and implementation of efficient and correct parallel machine learning algorithms easier. Their website states that paradigms like MapReduce lack expressiveness while lower level tools such as MPI present overhead by requiring the researcher to write code that beats a dead horse.</p>
<p>GraphLab has its own version of the map stage, called the <em>update</em> phase. Unlike MapReduce, the update phase can both read <em>and</em> modify <em>overlapping </em>sets of data. Recall that MapReduce requires data to be <em>partitioned</em>. GraphLab accomplishes this by allowing the user to specify data as a graph where each vertex and edge in the graph is associated memory. The update phases can be chained in such a way such that one update function can recursively trigger other update functions that operate on vertices in the graph. This graph-based approach would not only make machine learning on graphs more tractable, but it also improves dynamic iterative algorithms.</p>
<p>GraphLab also has its own version of the reduce stage, called the <em>sync operation. </em>The results of the sync operation are <em>global </em>and can be used by all vertices in the graph. In MapReduce, output from the reducers is local (until committed) and there is a strict data barrier among reducers. The sync operations are performed at time intervals, and there is not as strong of a tie between the update and sync phases. What I mean is that the sync intervals are not necessarily dependent on some prior update completing.</p>
<p>GraphLab&#8217;s website also contains the original <a href="http://www.select.cs.cmu.edu/publications/scripts/papers.cgi?Low+al:uai10graphlab">UAI paper</a> and <a href="http://graphlab.org/uai2010_graphlab.pptx">presentation</a>, a <a href="http://graphlab.org/abstractiononly.pdf">document better explaining the abstraction</a>, and there is even a <a href="http://groups.google.com/group/graphlabapi">Google Group for the GraphLab API</a>. To me, GraphLab seems like a very powerful generalization, and re-specification, of MapReduce.</p>
<p><strong>Storm<br />
</strong></p>
<p>Recently, Nathan Marz of BackType made waves in the Twitter big data community with a blog post titled <a href="http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-proce"><em>Preview of Storm: The Hadoop of Realtime Processing</em></a>. Within a day, Storm became known as &#8220;Real-time Hadoop&#8221; to the chagrin of some developers from Apache. Hadoop is a batch-processing system &#8212; that is, give it a lot of fixed data and it does something with it. Storm is real-time &#8212; it processes data in parallel as it streams.</p>
<p>Marz writes that with their previous system, much time was spent worrying about graphs of queues and workers: where to send and receive messages, deploying workers and queues, and a lack of fault tolerance. Storm abstracts all of these complications away. Storm is written in Clojure, but any programming language can be used to write programs on top of Storm. Storm is fault-tolerant, horizontally scalable, and reliable. Storm is also very fast, with ZeroMQ used as the underlying message passing system.</p>
<p>Nathan Marz is a software developer at <a href="http://www.backtype.com">BackType</a>, and made waves in 2010 with <a href="https://github.com/nathanmarz/cascalog">Cascalog</a>. Cascalog really took off after his presentation at the 2010 Hadoop Summit, and I am delighted I got to see him present it. Storm will be open-sourced soon and I hope to write more about it later.<strong><br />
 </strong></p>
<p>I included Storm in this post based on its colloquial name &#8220;Real-time Hadoop&#8221; &#8212; it is not clear to me whether or not Storm even uses MapReduce though.</p>
<p><strong>HPCC Systems (from LexisNexis)</strong></p>
<p>Perhaps the project with the least flattering name comes from <a href="https://github.com/nathanmarz/cascalog">LexisNexis</a>, which has developed its own framework for massive data analytics. <a href="http://hpccsystems.com/">HPCC</a> attempts to make writing parallel-processing workflows easier by using Enterprise Control Language (ECL), a declarative, data-centric language. I should note that SQL, Datalog and Pig are also said to be declarative, data-centric languages. A matter of fact, the development team has a converter for translating Pig jobs to ECL. HPCC is written in C++. Some have commented that this will make in-memory querying much faster because there is less bloated object sizes originating from the JVM. I also prefer C++ simply because it feels closer to human though &#8212; we think in terms of objects (object-oriented) at times, and a series of steps (procedural) at other times and use both thought processes together.</p>
<p>HPCC already has its own jungle of technologies like Hadoop. HPCC has two &#8220;systems&#8221; for processing and serving data: the Thor Data Refinery Cluster, and the Roxy Rapid Data Delivery Cluster. Thor is a data processor, like Hadoop. Roxie is similar to a data warehouse (like HBase) and supports transactions. HPCC uses a distributed file system.</p>
<p>Although details are still preliminary as is the system, this certainly has a &#8220;feel&#8221; for potentially being a solid alternative for Hadoop, but only time will tell. <strong><br />
 </strong></p>
<p><strong>With all these alternatives, why use Hadoop?</strong></p>
<p>One word: HDFS. For a moment, assume you could bring all of your files and data with you everywhere you go. No matter what system, or type of system, you login to, your data is intact waiting for you. Suppose you find a cool picture on the Internet. You save it directly to your file store and it goes everywhere you go. HDFS gives users the ability to dump very large datasets (usually log files) to this distributed filesystem and easily access it with tools, namely Hadoop. Not only does HDFS store a large amount of data, it is fault tolerant. Losing a disk, or a machine, typically does not spell disaster for your data. HDFS has become a reliable way to store data and share it with other open-source data analysis tools. <strong>Spark can read data from HDFS</strong>, but if you would rather stick with Hadoop, you can try to spice it up:</p>
<p><strong>Hadoop Streaming </strong>is an easy way to avoid the monolith of Vanilla Hadoop without leaving HDFS, and allows the user to write map and reduce functions in any language that supports writing to stdout, and reading from stdin. Choosing a simple language such as Python for Streaming allows the user to focus more on writing code that processes data rather than software engineering. Once code is written, it is easy to test from the command line:</p>
<pre>cat a_bunch_of_files | ./mapper.py | sort | ./reducer.py</pre>
<p>And, running and monitoring the job is similar to Vanilla Hadoop. Hadoop Streaming was my first introduction to Hadoop and it was quite pleasant.</p>
<p><strong>Or, you could use a Hadoopified project that better solves the problem. </strong>Vanilla Hadoop can do some sophisticated stuff, but it suffers the problems I mentioned at the beginning of the post. Developers have created software that works on HDFS, but is geared toward different audiences. A Data Scientist may prefer Pig or Hive for data analysis whereas a Systems and Software Engineer may prefer a workflow solution (Oozie, Cascading etc.) and a (modern) DBA may want to use HBase. Each of these achieve different goals, but still rely on HDFS.</p>
<div class="shr-publisher-794"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F08%2Fhadoop-fatigue-alternatives-to-hadoop%2F' data-shr_title='Hadoop+Fatigue+--+Alternatives+to+Hadoop'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>My Review of Hadoop Summit 2011 #hadoopsummit</title>
		<link>http://www.bytemining.com/2011/06/my-review-of-hadoop-summit-2011-hadoopsummit/</link>
		<comments>http://www.bytemining.com/2011/06/my-review-of-hadoop-summit-2011-hadoopsummit/#comments</comments>
		<pubDate>Thu, 30 Jun 2011 07:00:45 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=856</guid>
		<description><![CDATA[





<p>I woke up early and cheery Wednesday morning to attend the 2011 Hadoop Summit in Santa Clara, after a long drive from Los Angeles and the Big Data Camp that lasted until 10pm the night before. Having been to Hadoop Summit 2010, I was interested to see how much of the content in the conference had changed.</p>
<p>This year, there were approximately 1,600 participants and the summit was moved a few feet away to the Convention Center rather than the Hyatt. Still, space and seating was pretty cramped. That just goes to show how much the Hadoop field has grown in just one year.</p>
<p>Keynotes</p>
<p>We first heard a series of keynote speeches which I will summarize. The first keynote was from Jay Rossiter, SVP of the Cloud Platform Group at Yahoo. He introduced how Hadoop is used at Yahoo, which is fitting since they organized the event. The content of his presentation was very similar to last year&#8217;s. One interesting application of Hadoop at Yahoo was for &#8220;retiling&#8221; the map of the United States. I imagine this refers to the change in aerial imagery over time. When performed by hand, retiling took 6 weeks; with Hadoop, it took 5 days. Yahoo also [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><table border=0>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/06/hadoop1-e1309419213413.jpg"/></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/06/hadoop2-e1309419340704.jpg"/></td>
</tr>
</table>
<p>I woke up early and cheery Wednesday morning to attend the <a href="http://developer.yahoo.com/events/hadoopsummit2011/">2011 Hadoop Summit</a> in Santa Clara, after a long drive from Los Angeles and the <a href="http://www.bytemining.com/2011/06/big-data-camp-2011-bigdatacamp/">Big Data Camp</a> that lasted until 10pm the night before. Having been to <a href="http://www.bytemining.com/2010/06/my-experience-at-hadoop-summit-2010-hadoopsummit/">Hadoop Summit 2010</a>, I was interested to see how much of the content in the conference had changed.</p>
<p>This year, there were approximately 1,600 participants and the summit was moved a few feet away to the Convention Center rather than the Hyatt. Still, space and seating was pretty cramped. That just goes to show how much the Hadoop field has grown in just one year.</p>
<p><strong>Keynotes</strong></p>
<p>We first heard a series of keynote speeches which I will summarize. The first keynote was from Jay Rossiter, SVP of the Cloud Platform Group at Yahoo. He introduced how Hadoop is used at Yahoo, which is fitting since they organized the event. The content of his presentation was very similar to last year&#8217;s. One interesting application of Hadoop at Yahoo was for &#8220;retiling&#8221; the map of the United States. I imagine this refers to the change in aerial imagery over time. When performed by hand, retiling took 6 weeks; with Hadoop, it took 5 days. Yahoo also uses Hadoop for fraud detection, spam detection, search assist, geotagging data/local indexing, ad targeting, predicting supply and demand and the aggregation and categorization of news stories. Jay also mentioned that Dapper runs models on data with Hadoop for ad personalization. <strong>Jay also mentioned that Big Data conferences all over the country are selling out.</strong></p>
<p>Eric Baldeschwieler, the CEO of <a href="http://www.hortonworks.com/">Hortonworks</a> was next. Hortonworks seems to be a new company that spun off from Yahoo. Their goal is to provide commercial support and a full Apache Hadoop platform for users. Yes, they are very similar to <a href="http://www.cloudera.com">Cloudera</a>, and yes, they are competition. (Hortonworks and MapR both did a good job of not stepping on everyone&#8217;s toes in terms of how they presented themselves.) Cloudera provides its own distribution of Hadoop, which is of course similar to the Apache version. Hortonworks&#8217; goal is to provide similar services, but with more transparency by using the Apache Hadoop distribution rather than wrapping its own. Paraphrasing Eric, Hortonworks is open-source from the ground up. A bit later, Sanjay Radia also of Hortonworks discussed Hadoop for the enterprise. Hortonworks has contributed, or is working on security (preventing users from deleting others&#8217; data), <a href="http://en.wikipedia.org/wiki/Service_level_agreement">service level agreements (SLAs)</a>, predictability and a <a href="http://hadoop.apache.org/common/docs/r0.20.2/fair_scheduler.html">Fair-Share scheduler</a>.</p>
<p>Anant Jhingran, CTO of <a href="http://www.ibm.com">IBM</a> discussed how Hadoop was used in <a href="http://www-03.ibm.com/innovation/us/watson/index.html">IBM Watson</a>. It seemed pretty obvious that Hadoop or some form of map-reduce was used in the system, but it did not seem to be highly publicized. Watson learned from 200 million pages of data, about 2-5TB and required between 3000 and 4000 Watts. Anant went quickly through a cool user interface representing a Jeopardy board and stated that the user interface to an artificial intelligence application is just important as the application itself. He also prefers the term <a href="http://www.wisegeek.com/what-is-intelligence-augmentation-ia.htm">IA (intelligence augmentation)</a> over AI, and apparently this is a common distinction. To me, I interpret AI vs. IA to be artificial intelligence vs. knowledge discovery (data mining).</p>
<p>Karthic Ranganathan from <a href="http://www.facebook.com">Facebook</a> discussed Facebook&#8217;s messaging system which was built on <a href="http://hbase.apache.org/">HBase</a>, <a href="http://scribefire-next/hadoop.apache.org/hdfs/">HDFS</a> and MapReduce. Facebook sees 15 billion messages per month, excluding SMS and email, approximately 14TB of data! There are also 120 billion chat messages (25TB), for a grand total of almost 300TB per month. (I may have missed something as these numbers do not add up). Facebook uses HBase for the bodies of small messages, metadata, and for the search index. Facebook uses HBase because of its high write throughput and easy horizontal scalability. Facebook uses another system called <a href="http://www.facebook.com/note.php?note_id=76191543919">Haystack</a> for photos, bodies of large messages and attachments. Of course, HDFS is used for fault tolerance, scalability, checksums for data integrity and its MapReduce abilities. Profiles and services are partitioned by user. Each machine has 16 cores, with 12 1TB hard disks, and 48GB RAM (24GB used for HBase). Some things that Facebook would like contribute and improve: <a href="http://wiki.apache.org/hadoop/NameNode">NameNode</a> high availability and a second NameNode, better performance overall, and using flash memory to improve performance. Facebook often adds several columns to a table so that DevOps does not need to take the server offline to add new columns.</p>
<blockquote><p>Big Data conferences all over the country are selling out.</p></blockquote>
<p><strong>Breakout Sessions</strong></p>
<p><em>There were so many great sessions and I can only summarize the ones I attended. Check out the <a href="http://developer.yahoo.com/events/hadoopsummit2011/agenda.html">event agenda</a> for abstracts on all sessions.</em></p>
<p>First I attended <em>Web Crawl Cache &#8211; Using HBase to Manage a Copy of the Web. </em>In this talk, we learned about Yahoo&#8217;s Web Crawl Cache (WCC) that collects and organizes data from Microsoft as a result of a search deal. These snapshots of the web are not only useful for search, but also for drilling into other avenues such as local assets, influence and language corpora. WCC uses HBase for several reasons: bulk load, MapReduce jobs are efficient, random access reads, a usable consistency model, and it is easy to dynamically add columns (this seems to contradict Karthick&#8217;s claim).</p>
<p>It was very difficult to pick a session for the 1:45 to 2:15 time slot. Options included Next Generation Hadoop, Scaling out Realtime Data (Facebook) and Building Kafka (LinkedIn). I admire the work and clout that <a href="http://www.linkedin.com">LinkedIn</a> has built over the past year or two, so I attended Jay Kreps session. LinkedIn&#8217;s data pipeline includes a lot of tracking, logging, metrics, messages and queuing. LinkedIn attempted to use messaging systems such as <a href="http://en.wikipedia.org/wiki/Java_Message_Service">JMS</a> and <a href="http://www.rabbitmq.com/">RabbitMQ</a>. Streaming data is prevalent at LinkedIn such as search trends, click trends, invitation social networks etc. <a href="http://sna-projects.com/kafka/">Kafka</a> is LinkedIn&#8217;s solution for a distributed message queue; rather than polling for data, users subscribe to a data stream and data sources publish data to it. Kafka is 7000 lines of Scala, a functional and object-oriented language on top of the Java Virtual Machine (JVM). Kafka can produce about 250,000 messages per second (50 MB) and consume 550,000 messages per second (110 MB).</p>
<p>Next I attended another talk by Hortonworks, this time on <a href="http://incubator.apache.org/hcatalog/">HCatalog</a>. HCatalog changes the way we think about data in HDFS. No longer do we need to worry about files and directories. Instead, HCatalog seems to add a layer of abstraction on top of HDFS that treats data as a set of tables. Tools such as Pig and Hive use this layer of abstraction, and currently Hive is tightly integrated with HCatalog. Hortonworks intends to add support for HBase and Streaming later this year.</p>
<p>I waited all day to see <a href="http://www.cs.berkeley.edu/~matei/">Matei Zaharia</a>&#8216;s talk on <a href="http://www.spark-project.org">Spark</a>. Zaharia is a graduate student at UC Berkeley and it was a nice change of pace to see a student present some work. Spark is a data processing platform that sits on top of the <a href="http://www.mesosproject.org/">Mesos</a> cluster management project (also produced by Berkeley). Mesos can handle 10,000s nodes, 100s of concurrent jobs and can be isolated in Linux containers (i.e. <a href="http://www.openvz.org">OpenVZ</a>). <em>Spark aims to extend MapReduce for iterative algorithms, and interactive low latency data mining.</em> One major difference between MapReduce and Spark is that MapReduce is acyclic. That is, data flows in from a stable source, is processed, and flows out to a stable filesystem. Spark allows iterative computation on the same data, which would form a cycle if jobs were visualized. Resilient Distributed Dataset (RDD) serves as an abstraction to raw data, and some data is kept in memory and cached for later use. This last point is very important; Spark allows data to be committed in RAM for an approximate 20x speedup over MapReduce based on disks. RDDs are immutable and created through parallel transformations such as map, filter, groupBy and reduce. RDD immutability is similar to immutable types in functional programming languages. It does not mean that the dataset cannot change. Instead, it means that a new copy of the dataset is created, with the change included. The user can also perform <em>actions </em>on RDDs such as count, collect, etc. Some applications using Spark are traffic prediction (Berkeley), spam classification (Twitter), kmeans, alternating least squares matrix factorization, and network simulation.</p>
<blockquote><p>The main takeaway from Hadoop Summit 2010 was Cascalog. I predict the main takeaway from Hadoop Summit 2011 is Spark.</p></blockquote>
<p>One time at work I had a bizarre issue with corrupted data in HDFS. After that, I began blaming everything on HDFS. The next session <em>Data Integrity and Availability of HDFS </em>was englightening. HDFS takes good care of Yahoo&#8217;s data. We can trust Yahoo because if HDFS breaks, Yahoo begins losing money so they know what they are talking about! Yahoo&#8217;s goal is to have 60 PB online all the time. The key to HDFS reliability is <a href="http://en.wikipedia.org/wiki/Replication_%28computer_science%29">replication</a>. A replication factor of 3 (3 copies of every file? block?) is appropriate. A replication factor of 2 is also quite robust, but should only be used when there is a backup of the data because the probability of data loss is much higher. Yahoo has had issues with losing blocks (blocks are pieces of data, so lost blocks = data loss). There are a variety of reasons and most of them had nothing to do with HDFS. One cause of lost blocks is a bug in a Hadoop component like Pig, particularly a new version. In one incident, a new version of Pig opened a lot of files without closing them, and created a lot of abandoned files. In the speaker&#8217;s anecdotal case study, none of the incidents of data loss were caused by HDFS proper. Other causes of data loss encoutered were exhausting disk space, users hammering HDFS, and &#8220;other.&#8221; The speaker noted that NameNode high availability (a hot topic) would have only helped in 8 of the 36 incidents studied. Some ways of preventing data loss include resource allocation, selecting good tenants of a cluster, and fixing hardware errors quickly.</p>
<blockquote><p>If your job isn&#8217;t running, it&#8217;s not likely caused by HDFS.
</p></blockquote>
<p>Bill Graham of <a href="http://www.cbsinteractive.com/">CBS Interactive</a> gave an interesting talk about using Hadoop to build a graph of users a content. Surprisingly, CBSi has quite a large arsenal of MapReduce enabled technologies: <a href="http://incubator.apache.org/chukwa/">Chukwa</a>, Pig, Hive, HBase, <a href="http://www.cascading.org">Cascading</a>, <a href="http://www.cloudera.com/blog/2009/06/introducing-sqoop/">Sqoop</a> and <a href="http://yahoo.github.com/oozie/">Oozie</a>. CBSi uses only 100 nodes with 500 TB of disk space for processing data associated with 235 million uniques (individuals, roughly). Mapping users to content should be easy, right? Well, some users have multiple identities, including anonymous identities. The goal is to create a holistic graph that &#8220;matches&#8221; all of the identities efficiently for uses such as ad targeting. CBSi&#8217;s needs in a Hadoop platform: rapid experimentation and data mining, and to power new site features and ad optimization. The main vehicle for representing data is a Pig RDF that allows for a kind of graph based join so to speak. CBSi hopes to add Oozie, <a href="http://sna-projects.com/azkaban/">Azkaban</a>, HCatalog and <a href="http://blogs.apache.org/hama/">Hama</a> (graph processing) to its arsenal.</p>
<p><a href="http://www.mapr.com">MapR</a> was a very prominent sponsor of Hadoop Summit. M. C. Srivas presented a technical discussion of MapR&#8217;s capabilities and how it differs from Apache Hadoop. MapR is a full distribution of Hadoop and is 100% compatible with the Apache distribution and projects such as Pig and Oozie. MapR is fast and boasts high availability by rethinking the NameNode. The NameNode is a bottleneck since 60% of file operations are metadata. The NameNode and its limitations limit the size of a cluster. To resolve some problems with the NameNode, MapR turns every server into a metadata server. Since metadata is seldom retrieved, it is paged to disk so more RAM can be used for MapReduce proper. MapR distributes NameNode functionality and provides full random read and write semantics as well as export to NFS. With the distributed NameNode, runaway tasks no longer take down the NameNode. MapR has some lofty performance goals. While HDFS can handle 10-50PB, MapR can handle 1010 EB (exabytes). While HDFS can handle 2000 nodes in a cluster, MapR can handle 10,000 or more. It was mentioned at BigDataCamp that MapR does not rely on HDFS at all.</p>
<p>The final session I attended was Avery Ching&#8217;s talk on <em>Giraph: Large-scale Graph Processing on Hadoop. </em>Unfortunately, Avery jumped right into the technical details of <a href="https://github.com/aching/Giraph">Giraph</a> without giving a high level overview of the problem Giraph solves. Also, his slides were in 10 point font and I could not read them. Combine this with the fact that my brain was exhausted, so I wanted to head to the bar. Vanilla Hadoop incurs too much overhead for graph data processing. Yahoo used <a href="http://www.lam-mpi.org/">MPI</a> in the past for graph data but it had no fault tolerance and was too generic. Giraph is a library for iterative graph processing. Giraph is fault-tolerant and dynamic. Giraph takes a vertex centric approach to graph data. I found this interesting because most of my work is edge centric. Overall, Giraph is similar in goal to Pregel, but available to non-Googlers and has no single point of failure (except those incurred by Hadoop).</p>
<p>Now I have to catch my breath with some wine, beer and cheese at the nice happy hour reception afterwards. It was a long day, and a great day at Hadoop Summit 2011 and I will of course be back next year. I have no clue what is in store for me next year. Will the NameNode be removed as the single point of failure? Will other open-source software start integrating Hadoop? We shall see&#8230;</p>
<p>And now it is time to head back to Los Angeles.</p>
<p>Have a happy and safe Fourth of July!</p>
<div class="shr-publisher-856"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F06%2Fmy-review-of-hadoop-summit-2011-hadoopsummit%2F' data-shr_title='My+Review+of+Hadoop+Summit+2011+%23hadoopsummit'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/06/my-review-of-hadoop-summit-2011-hadoopsummit/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Big Data Camp 2011 #BigDataCamp</title>
		<link>http://www.bytemining.com/2011/06/big-data-camp-2011-bigdatacamp/</link>
		<comments>http://www.bytemining.com/2011/06/big-data-camp-2011-bigdatacamp/#comments</comments>
		<pubDate>Wed, 29 Jun 2011 06:35:45 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=833</guid>
		<description><![CDATA[<p>It has been a while since I have been to Silicon Valley, but Hadoop Summit gave me the opportunity to go. To make the most of the long trip, I also decided to check out BigDataCamp held the night before from 5:30 to 10pm. Although the weather was as predicted, I was not prepared for the deluge of pouring rain in the end of June. The weather is one of the things that is preventing me from moving up to Silicon Valley.</p>
<p>The food/drinks/networking event must have been amazing because it was very difficult to get everyone to come to the main room to start the event! We started with a series of lightning talks from some familiar names and some unfamiliar ones.</p>
<p>Chris Wensel, the developer of Cascading, is also the founder of Concurrent, Inc. Cascading is an alternate API for Map-Reduce written in Java. With Cascading, developers can chain multiple map-reduce jobs to form an ad hoc workflow. Cascading adds a built-in planner to manage jobs. Cascading usually infers Hadoop, but Cascading can run on other platforms including EMC Greenplum and the new MapR project. RazorFish and BestBuy use Cascading for behavioral targeting. Flightcaster uses a domain specific language (DSL) [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/hadoop.png" alt="" class="lfloatbox" />It has been a while since I have been to Silicon Valley, but <a href="http://developer.yahoo.com/events/hadoopsummit2011/">Hadoop Summit</a> gave me the opportunity to go. To make the most of the long trip, I also decided to check out <a href="http://www.bigdatacamp.org">BigDataCamp</a> held the night before from 5:30 to 10pm. Although the weather was as predicted, I was not prepared for the deluge of pouring rain in the end of June. The weather is one of the things that is preventing me from moving up to Silicon Valley.</p>
<p>The food/drinks/networking event must have been amazing because it was very difficult to get everyone to come to the main room to start the event! We started with a series of lightning talks from some familiar names and some unfamiliar ones.</p>
<p><a href="http://chris.wensel.net/">Chris Wensel</a>, the developer of <a href="http://www.cascading.org">Cascading</a>, is also the founder of <a href="http://www.concurrentinc.com">Concurrent, Inc.</a> Cascading is an alternate API for Map-Reduce written in Java. With Cascading, developers can chain multiple map-reduce jobs to form an ad hoc workflow. Cascading adds a built-in planner to manage jobs. Cascading usually infers Hadoop, but Cascading can run on other platforms including EMC Greenplum and the new <a href="http://www.mapr.com">MapR</a> project. <a href="http://www.razorfish.com">RazorFish</a> and <a href="http://www.bestbuy.com">BestBuy</a> use Cascading for behavioral targeting. <a href="http://www.flightcaster.com">Flightcaster</a> uses a domain specific language (DSL) written in <a href="http://clojure.org">Clojure</a> on top of Cascading for large data processing jobs. <a href="http://www.etsy.com">Etsy</a> uses a DSL written in JRuby as a layer on top of Cascading. Of course, the big player is <a href="http://www.backtype.com">BackType</a>. <a href="http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html">Cascalog</a> combines Cascading with the Datalog language to provide a declarative language for working with data and map-reduce. Wensel noted that one disadvantage of Pig and Hive that Cascading addresses is that Pig and Hive lack a physical planner. Workflow managers such as <a href="http://scribefire-next/yahoo.github.com/oozie">Oozie</a> and <a href="http://sna-projects.com/azkaban/">Azaband</a> can run Cascading jobs as part of a workflow. Version 2.0 of Cascading removes Hadoop as a dependency and will allow users to run Cascading jobs on data that is in RAM rather than on disk.</p>
<p>James Falgout from <a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CB8QFjAA&amp;url=http%3A%2F%2Fwww.pervasivedatarush.com%2F&amp;rct=j&amp;q=pervasive%20datarush&amp;ei=IWMLTuHPJ5Oitgfvi817&amp;usg=AFQjCNHJxwjEHZqwPkA-LLczcc9_5Q4-mw&amp;cad=rja">Pervasive DataRush</a> presented the second lightning talk. Pervasive&#8217;s products seem to use this &#8220;dataflow&#8221; paradigm that attempts to fill in features that are missing in map-reduce. The basic description compared dataflow to the Unix shell pipeline with message passing. James showed an example dataflow that a user could configure visually. Pervasive is working on integrating dataflow with Hive.</p>
<p><a href="http://www.quest.com/newsroom/Guy-Harrison.aspx">Guy Harrison</a> from <a href="http://www.quest.com">Quest Software</a> introduced their system Toad for Cloud Databases. Toad attempts to merge data from several different data sources for analysis such as Hive, <a href="http://www.mongodb.com">MongoDB</a>, and <a href="http://cassandra.apache.org/">Cassandra</a>. Unfortunately, Guy&#8217;s thick Australian accent made his humorous talk unintelligible to me (hearing loss?).</p>
<p>Steve Wooledge from <a href="http://www.asterdata.com/">AsterData</a> (now part of <a href="http://www.teradata.com">Teradata</a>) discussed the company&#8217;s product goal of taking a standard relational database system and integrating map-reduce on top of it. Such a system is flexible and allows both SQL-like access as well as programmatic access to data. This hybrid row-oriented and column-oriented datastore can be used for path and pattern matching, text processing and graph traversal among the usual tasks. nPath is a product that enhances a system with transactional data analytics (click analysis, sessionalization).</p>
<p>Andrew Yu from <a href="http://www.emc.com/">EMC</a> presented some of EMC&#8217;s data analytics products. I wrote about EMC in an earlier blog post so I will spare the details. EMC offers a data warehouse product as well as a hybrid, pre-configured system containing its Greenplum warehouse and map-reduce built-in.</p>
<p>Ben Lee from <a href="http://www.foursquare.com">Foursquare</a> discussed how big data is used at Foursquare and gave some statistics about its service. This was by far the most interesting talk to me. Foursqaure offers realtime suggestions of places to visit based on the user&#8217;s history, and the user&#8217;s friends&#8217; histories based on day of week and time of day. Foursquare has 10 million users, 50 million venues, and 750 million check-ins. There are over 3 million check-ins per day. 10,000 developers use Foursquare&#8217;s API. MongoDB is the main datastore and <a href="http://www.scala-lang.org">Scala</a> is used for the front end. Back end data processing uses Hadoop (both vanilla, and Streaming) as well as <a href="http://archive.cloudera.com/cdh/3/flume-0.9.1+1/UserGuide.html">Flume</a>, Elastic MapReduce, and S3. Ben displayed an awesome visualization of check-in data; researchers took check-ins from New York City and performed sentiment analysis on the text attached to the check-in. The visualization suggested that people were the &#8220;happiest&#8221; in Manhattan.</p>
<p>Paul Baclace introduced some software called Phatvis that allows developers to visualize map-reduce jobs. It is his hope that the visualization can be used to fine tune Hadoop parameters based on evidence from prior jobs. The source can be found <a href="http://www.assembla.com/spaces/phatvis">here</a></p>
<p>Of course, the fun in every &#8220;unconference&#8221; is the circus known as scheduling the sessions. Some of the proposed sessions:</p>
<ul>
<li>Big Data 101 / Intro to Hadoop</li>
<li>Extract MapReduce Data into Relational Database High Performance Database</li>
<li>&#8220;ETL was Yesterday&#8221; What&#8217;s next?</li>
<li>Operations of Hadoop Clusters</li>
<li>SQL / NoSQL Why not Both? (Aster)</li>
<li>Geodata</li>
<li>Big Data Retention / Compression</li>
<li>Business Intelligence and Hadoop</li>
<li>Data Management Lifecycle</li>
<li>Distributions of Hadoop</li>
<li>Hadoop for Bioinformatics and Healthcare</li>
</ul>
<p>The topics did not seem exciting this time, and seemed to have a lot of overlap with presentations at Hadoop Summit, but I found two (we could only attend two) that stood out.</p>
<p><strong>Session 1: Operating a Hadoop Cluster</strong></p>
<p>Thank goodness managing a Hadoop cluster is not in my job description (only small clusters I use for research). Charles Wimmer, the lead of the Operations track for Hadoop Summit, lead this discussion and much of the discussion dovetailed off of incidents that occurred at Yahoo. A popular topic of discussion was backup. There is no such thing as &#8220;backing up&#8221; a Hadoop cluster we agreed. Any data that is important should be replicated, preferrably 3 times, or transmitted in parallel over a pipe to multiple data centers<strong>. </strong>One strict limitation of replication is that if some new release of Hadoop, or some new Hadoop distribution contains a bug that corrupts the data, all replicates may also be corrupted. <strong></strong></p>
<p>Discussion then turned to hardware. Yahoo uses high-density storage nodes with 6 drives each containing 2-3TB of space. Charles mentioned that a common problem with Hadoop is that it is difficult to keep the CPUs busy especially in a server with 8 Nehalem processors (8 CPUs or 8 cores?). The major reason for this is that the main bottleneck in map-reduce jobs is the network I/O required in the shuffle phase as data comes out of the mappers. The map phase is the most CPU bound phase. Wimmer, and several others, made one thing clear: <strong>use SATA, not SAS. </strong>Apparently SATA and SAS drives have similar read performance (I believe I misheard that) for practical purposes. The original Google map-reduce was based on commodity hardware and quantity is more important than quality (within reason). For this reason, SATA provides a lot more space for your data. The same amount of space is an order of magnitude more expensive for SAS drives.<strong></strong></p>
<p>The next topic of discussion was the NameNode as the single point of failure. Apparently the <a href="http://www.mapr.com">MapR</a> system does not use the HDFS, and recovering from a lost NameNode is not as severe as it is for Hadoop. Hadoop 0.20.2 also supposedly introduces sharding, called NameNode federation, where the namespace is divided over several NameNodes.</p>
<p>Hadoop has some issues with certain types of scalability, particularly with the JobTracker. When a large job with a large number of mappers and reducers finish quickly, the TaskTrackers send an influx of messages to the JobTracker and it gets overwhelmed. To prevent users from thrashing a cluster, use a capacity scheduler to put hard caps on queues. There was also some high level discussion of QoS-like functionality among users and sophisticated monitoring of jobs. Map-Reduce NextGen improves scalability by allocating a JobTracker to each individual job whose purpose is solely to monitor resource allocation. The biggest feature Charles would like to see is high availability NameNodes.</p>
<p>Yahoo boasts an impressive 22 clusters each containing between 400 and 4200 nodes. A fellow from AOL indicated that AOL has a cluster of size close to 1000. Is AOL coming back from the dead?</p>
<p>Thank goodness managing a Hadoop cluster is not in my job description&#8230;</p>
<p><strong>Session 2: Geodata</strong></p>
<p>I do not get the opportunity to work with geographical data often, so I was curious to see what these folks had to say. The discussion was lead by a fellow named Brian from <a href="http://www.osgeo.org">OSGeo</a>. The largest point that I took away from this talk was that not an incredible amount of thought has been dedicated to Big Geodata, particularly how to store and process it. <a href="http://www.postgresql.com">PostgreSQL</a> and <a href="http://postgis.refractions.net/">PostGIS</a> are a few ways to store and analyze manageable amounts of data, but not large data. MongoDB is one solution but has its issues. A fellow from Foursquare mentioned that MongoDB cannot shard across geographic data, but I could not hear precisely what he said to that effect. The biggest challenge seems to be a lack of an indexer capable of indexing a large amount of geospatial data aside from the standard RTree implementation. I believe that geodata as well as streaming data and multimedia are some of the biggest unsolved problems in Big Data.<strong><br />
</strong></p>
<p>Anyways, on to Hadoop Summit!</p>
<div class="shr-publisher-833"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F06%2Fbig-data-camp-2011-bigdatacamp%2F' data-shr_title='Big+Data+Camp+2011+%23BigDataCamp'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/06/big-data-camp-2011-bigdatacamp/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Google &#8212; Is Search-by-Multimedia on the Way?</title>
		<link>http://www.bytemining.com/2011/06/google-is-search-by-multimedia-on-the-way/</link>
		<comments>http://www.bytemining.com/2011/06/google-is-search-by-multimedia-on-the-way/#comments</comments>
		<pubDate>Tue, 21 Jun 2011 17:00:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=819</guid>
		<description><![CDATA[<p>Recently, I have been thinking about alternate ways of specifying search queries other than with text. A couple of weeks ago I came across a piece of music that I could not identify. I thought it would be a huge win for a search engine to allow me to upload this piece, and it would present me with matches, or near matches to other pieces that sound similar, or have similar characteristics. Some services already exist. Shazam allows a user to place a microphone near playing music and it will identify the artist and song. Some uses of search-by-sound:</p>

Music identification (&#8220;solved&#8221; &#8211; Shazam)
Music personalizaton and recommendation (&#8220;solved&#8221; &#8211; Pandora)
Identification of the source of a sound (i.e. a species of bird, a musical instrument, an inanimate object)
MP3 and media file search
Finding material that violates copyright

<p>As our motivating example, consider we find some really cool graphic on the web and we want to know where it likely originated (i.e. art, a meme). In such a search engine, we could upload the graphic and get results containing the exact image, or images that are very similar such as variations of the image (crop, resize, borders, different effects), modifications of the image (consider Obama-izing [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Recently, I have been thinking about alternate ways of specifying search queries other than with text. A couple of weeks ago I came across a piece of music that I could not identify. I thought it would be a huge win for a search engine to allow me to upload this piece, and it would present me with matches, or near matches to other pieces that sound similar, or have similar characteristics. Some services already exist. <a href="http://www.shazam.com">Shazam</a> allows a user to place a microphone near playing music and it will identify the artist and song. Some uses of search-by-sound:</p>
<ul>
<li>Music identification (&#8220;solved&#8221; &#8211; Shazam)</li>
<li>Music personalizaton and recommendation (&#8220;solved&#8221; &#8211; <a href="http://www.pandora.com">Pandora</a>)</li>
<li>Identification of the source of a sound (i.e. a species of bird, a musical instrument, an inanimate object)</li>
<li>MP3 and media file search</li>
<li>Finding material that violates copyright</li>
</ul>
<p>As our motivating example, consider we find some really cool graphic on the web and we want to know where it likely originated (i.e. art, a meme). In such a search engine, we could upload the graphic and get results containing the exact image, or images that are very similar such as variations of the image (crop, resize, borders, different effects), modifications of the image (consider Obama-izing [the campaign logo] someone&#8217;s picture), and semantically similar images (different photos of the same object or person). Wouldn&#8217;t this be cool? A billion-dollar idea, right?</p>
<p>Well, Google apparently beat me (and millions of others I&#8217;m sure) to it with its search-by-image feature on Google Images. I uploaded a photo of myself to see what I would get. We see my school website (where the image originated), as well as several other sites that use my <a href="http://www.gravatar.com">Gravatar</a>. Not too bad.</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_res.png" alt="" /></p>
<p>On the results page, users can also provide some type of labeled data to Google. I am not exactly sure what it is used for yet, but note the text in the search bar: &#8220;Describe this image.&#8221; Upon entering my name, Google found another photo that looks almost identical to the first one &#8212; a <em>variation</em>.</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_describe.png" alt="" /></p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_sim1.png" alt="" /></p>
<p>Below are the &#8220;visually related&#8221; images that were presented to me (before I labeled my photo in the search bar):</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_sim2.png" alt="" /></p>
<p>I see Steve Jobs (I am honored), but 7 out of 16 images are women, and of the men, we look nothing alike. I know, I know, &#8220;visually related&#8221; refers to similarity in pixels between images, but I expected more. In these images, we see a lot of red and blue hues.</p>
<p>Let&#8217;s try something that will generate many more hits: a popular meme&#8230;</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_res2.png" alt="" /></p>
<p>The image I uploaded was originally posted on Amazon S3, and is linked to by the above two web pages. <strong>Google does a much better job when using a URL rather than uploading an image for obvious reasons.</strong> More interestingly, the &#8220;visually similar&#8221; images show variations and modifications of the same image, based on pixel similarity.</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_sim31.png" alt="" /></p>
<p>And we get also see web pages containing a copy of the image (not linked to the original S3 file):</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_sim4.png" alt="" /></p>
<p><strong>But this Isn&#8217;t Good Enough Yet!</strong></p>
<p>Google &#8220;Search-by-Image&#8221; is an awesome first step, and I look forward to seeing more as it is undoubtedly coming. For search-by-image (or search-by-multimedia) to be useful, it must also take &#8220;semantic&#8221; or conceptual knowledge into account, just like with text search. That is, if I upload a photo of myself, I should get back other photos of myself from various (hopefully authorized) sources. Or, if we upload a photo of the Eiffel Tower, we should get back integrated search results containing other images of the Eiffel Tower as well as text results with information about the Eiffel Tower, and perhaps a tourist&#8217;s video or documentary.<strong></strong></p>
<p>One may at first believe that the <a href="http://knowyourmeme.com/memes/o-rly">O RLY</a> search used some semantic knowledge; however, all of the images share a large number of pixels and these images are likely just &#8220;visually similar&#8221; as stated. Using semantic knowledge, one may see results of other famous owls used in memes in addition to the variations and modifications of the O RLY owl.</p>
<p>All of the data collected by such a system would also provide a hell of a corpus for image and multimedia classification. Researchers could construct classifiers for detecting spammy multimedia, knockoff multimedia (second, third generation grain in images, waveform distortion in audio), pornographic content, as well as augmenting labeled and unlabeled multimedia with metadata. For example, suppose we take a picture of what I think is a rhodendron (inside joke for readers). With such a large corpus, I can upload the photo and have Google (or some other AwesomeSearch) retag the image as that of a hydrangea instead.</p>
<p>Uses of search-by-multimedia with semantic knowledge:</p>
<ul>
<li>Cross referencing objects or people on various different sites.</li>
<li>Product search when textual information (or QR code) is not known</li>
<li>Catching criminals</li>
<li>Cataloging media</li>
<li>Methods for multimedia spam detection</li>
<li>Geolocation without use of GPS or WiFi, and location search</li>
<li>Augmentation of metadata and tagging of objects, people, etc.</li>
<li>Detecting adult, inappropriate or illegal content.</li>
<li>Identification of actions from images, video or audio and retrieval of related information</li>
</ul>
<p>Of course, search-by-multimedia poses the same challenges that we face in big data today:</p>
<ul>
<li>choosing and boosting the proper features</li>
<li>collecting a significant and correctly labeled corpus</li>
<li>fast processing of large datasets with new and existing machine learning algorithms</li>
<li>efficient indexing and retrieval algorithms to match queries with probably results</li>
<li>these things are easier said than done, but a lot of fun.</li>
</ul>
<p>Search-by-multimedia is a very interesting concept and is exciting to think about. In this age of big data and technology, anything will possible. I look forward to the day where anything on the Internet can be found, no matter its content or medium.</p>
<p><em>To check out Google&#8217;s search-by-image, <a href="http://images.google.com">click here</a> and then click on the camera icon.</em></p>
<div class="shr-publisher-819"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F06%2Fgoogle-is-search-by-multimedia-on-the-way%2F' data-shr_title='Google+--+Is+Search-by-Multimedia+on+the+Way%3F'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/06/google-is-search-by-multimedia-on-the-way/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Want to Build a Research Server?</title>
		<link>http://www.bytemining.com/2011/05/want-to-build-a-research-server-6/</link>
		<comments>http://www.bytemining.com/2011/05/want-to-build-a-research-server-6/#comments</comments>
		<pubDate>Tue, 31 May 2011 17:00:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=763</guid>
		<description><![CDATA[<p>I am usually pretty reserved with cash, but after working full-time for six months, I finally decided to spend some of my money on building a new research development server. This process was long overdue and the reason it took me so long to commit to this project was all of the new technology developed since building my last server. This &#8220;new technology&#8221; can be pretty confusing unless one specializes in computer architecture. I want to share what I have learned throughout this process, while giving some background. These are only my opinions, and I may be wrong on some things as I am not a hardware expert. I encourage you to read and learn more on your own.</p>
<p>The CPU/Processor</p>
<p>If you are reading this article, I probably do not need to explain what the CPU/processor does. For high performance computing, you will want to get a CPU that is very &#8220;fast&#8221; and also has multiple cores. The definition of the word &#8220;fast&#8221; is in the eyes of the beholder and typically refers to more than just clock speed (GHz). In the constant war between AMD and Intel, I stick with Intel. AMD processors are powerful, but they seem to have [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>I am usually pretty reserved with cash, but after working full-time for six months, I finally decided to spend some of my money on building a new research development server. This process was long overdue and the reason it took me so long to commit to this project was all of the new technology developed since building my last server. This &#8220;new technology&#8221; can be pretty confusing unless one specializes in computer architecture. I want to share what I have learned throughout this process, while giving some background. These are only my opinions, and I may be wrong on some things as I am not a hardware expert. I encourage you to read and learn more on your own.</p>
<p><strong>The CPU/Processor</strong></p>
<p>If you are reading this article, I probably do not need to explain what the CPU/processor does. For high performance computing, you will want to get a CPU that is very &#8220;fast&#8221; and also has multiple cores. The definition of the word &#8220;fast&#8221; is in the eyes of the beholder and typically refers to more than just clock speed (GHz). In the constant war between AMD and Intel, I stick with Intel. AMD processors are powerful, but they seem to have more of a market with gamers. Intel is my preference, but I have not yet run into anyone that feels strongly towards AMD for high-performance computing (HPC). There are two main processor lines under Intel: standard, and <a href="http://en.wikipedia.org/wiki/Intel_Xeon">Xeon</a>. Standard processors are your run of the mill CPUs that are found in consumer desktop machines. Xeon processors are designed for non-consumer server, workstation and embedded systems use. I do not consider researchers as &#8220;consumers,&#8221; we are producers, so the Xeon family is better suited to our needs. On the other hand, you may find that a standard CPU will fit your needs for your particular research or use case. <a href="http://en.wikipedia.org/wiki/Xeon">Xeon processors typically have more cache and more multiprocessing capabilities</a>&#8230;and they are a lot more expensive. <em>For high-performance computing, I strongly suggest Intel Xeon</em>.</p>
<p>After months of research, I have concluded that multiple Intel Xeon processors are better than one <a href="http://www.intel.com/products/processor/corei7/index.htm">Intel Core i7</a>. As of the time of this writing, it seems that i7 processors cannot be doubled (or tripled etc.) up like Xeons can. Like the AMD, the i7 seems to be favored by gamers and those needing a richer multimedia experience.&nbsp;</p>
<p>In 2011, most CPUs in new systems have multiple <a href="http://en.wikipedia.org/wiki/Multi-core_processor"><em>cores</em></a>. Each core can essentially run one process each. A system with <em>n </em>cores can run <em>n </em>processes simultaneously. Many CPUs are <a href="http://en.wikipedia.org/wiki/Hyperthreading">hyperthreading</a> enabled, meaning that each core can actually run 2 threads simultaneously, bringing the total number of threads to 2<em>n</em>. But can&#8217;t the system already run multiple processes concurrently? We can run Firefox, TweetDeck, Thunderbird etc. concurrently, right? In practice, it <em>seems </em>that the CPU is processing multiple threads simultaneously. If we could slow down time to the micro level, one would see that the CPU works on one process at a time, then does a <em><a href="http://en.wikipedia.org/wiki/Context_switch">context switch</a> </em>to another process. Theoretically, this gives the illusion that the CPU is running multiple processes simultaneously.</p>
<p>While Intel makes great products, its inventory is a nightmare to navigate. There are several things that you must know to ballpark a particular CPU model.</p>
<ul>
<li>the <em>model number</em> (the most reliable!)</li>
<li>the <em>brand name</em> specifies a group of CPU models satisfying similar use cases (Core [i3/i5/i7/i9], Core 2 Duo, Quad Core, Pentium, Xeon).</li>
<li>the <em>architecture/subarchitecture</em> &#8212; specifies a <em>type </em>of processor within a brand, each containing many series (Nehalem, Westmere, Sandy Bridge are common ones these days)</li>
<li>the <em>chipset</em> (not commonly referred to, examples: Tylersburg, Cougar Point, Panther Point)</li>
<li>the <em>platform</em> which refers to a set of models (e.g. Harpertown, Jasper Forest, Gainestown, Prescott, Gulftown). Models within a series are typically only differentiated by clock speed (GHz).</li>
<li>the <em>socket type</em> specifies the shape and size of the CPU. The CPU and the motherboard must have the same socket type (i.e. LGA1366, Socket 775)</li>
</ul>
<p>As if this is not confusing enough, each Intel Xeon model number is prefixed with a letter for different use cases. The letter distinguishes CPUs with differing <a href="http://en.wikipedia.org/wiki/CPU_power_dissipation">thermal dissipation power (TDP)</a>. (<a href="http://www.tomshardware.com/forum/281376-28-what-difference-xeon-series-processors">source</a>)</p>
<ul>
<li><strong>W </strong>stands for &#8220;Workstation&#8221; and is meant to be installed in pairs. This designation does not seem very common anymore. They typically run the fastest (clock speed) and the hottest. They require significant cooling.</li>
<li><strong>E </strong>is &#8220;mainstream (rack mount)&#8221; and the standard model of CPU. Although it is &#8220;standard,&#8221; there is nothing wrong with it performancewise, but will run hot even when idle.</li>
<li><strong>X </strong>stands for &#8220;performance&#8221; and are similar to E but provide for extra overclocking capabilities and have lower idle power draw.</li>
<li><strong>L </strong>stands for &#8220;power optimized&#8221; and are low voltage CPUs (60W or less) that are typically only used for data centers or rack servers. They typically do not come in the higher clock speeds etc.</li>
</ul>
<p>For the Intel Xeon, model numbers indicate what configuration it is compatible with on the motherboard (<a href="http://techtips.salon.com/features-intel-xeon-processor-26.html">source</a>):</p>
<ul>
<li>3xxx Xeons are designed to be used by themselves, as the only CPU on the motherboard.</li>
<li>5xxx Xeons are designed to be used in pairs; two CPUs on the motherboard.</li>
<li>7xxx Xeons are designed to be used in pairs, or in larger groups.</li>
</ul>
<p>The 2 CPUs that I purchased are model <a href="http://ark.intel.com/Product.aspx?id=48768">Intel Xeon E5645</a>. The Intel Xeon E5645 is part of the Gulftown platform of the Xeon family. It uses the Westmere subarchiture which is the 32 nm shrink of the Nehalem architecture spec and connects to the system bus using socket LGA1366. (This is the same architecture used for the i7-9xx series to make it more confusing) The E means that it is a &#8220;mainstream&#8221; CPU. Since it is a 5000 model, it is installed with another identical CPU on the same board.</p>
<p>The <strong>number of cores is important</strong>. Most chips in current desktops contain 2 or 4 cores. Higher end systems and servers may have 6, 8 or 10 cores per chip. Xeons with 8 and 10 cores per unit debuted in Q2 of 2011 and are very expensive (about $2000 for 8 cores). They also require a brand new socket type (LGA1367), which means a new, expensive motherboard. A CPU with more cores allows an application to perform <em>several units of work per task; </em>these processors allow higher bandwidth.</p>
<p>The <strong>clock speed (GHz)</strong> used to be the deciding factor for most people, until <a href="http://en.wikipedia.org/wiki/Moore%27s_law">Moore&#8217;s Law</a> broke down. Higher clock speed possibly allows a single process to complete <em>faster</em>. Since games typically use a limited number of threads and require quick performance, a single i7 is a good choice. The i7 has multiple cores, and also has a very high clock speed.</p>
<p>The <strong>cache size and speed </strong>is also important. The cache allows very high speed access to memory locations that are frequently accessed by copying the data from RAM into the <a href="http://en.wikipedia.org/wiki/Cpu_cache">CPU cache</a>. Modern systems typically have three levels of cache: L1, L2 and L3. L1 cache is said to be the &#8220;closest&#8221; to the CPU, meaning the CPU queries the L1 cache first when performing a memory access. The L1 cache is the smallest. The L2 and L3 caches are accessed next in order, and L3 cache is larger than L2 cache. <em>Very simply put, CPUs with larger caches (especially L1) are better.</em></p>
<p>Newer processors report CPU throughput as <a href="http://en.wikipedia.org/wiki/GT/s">gigatransfers per second (GT/sec)</a> which, like GHz, quantifies some measure of &#8220;speed.&#8221; Using GT/s, one can compute the number of bits the CPU can transfer per second as</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cmbox%7BData%20Transfer%20Rate%7D%20%3D%20%5Cmbox%7BChannel%20Width%2C%20bits%2Ftransfer%7D%20%5Ctimes%20%5Cmbox%7Btransfers%2Fsec%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mbox{Data Transfer Rate} = \mbox{Channel Width, bits/transfer} \times \mbox{transfers/sec}' title='\mbox{Data Transfer Rate} = \mbox{Channel Width, bits/transfer} \times \mbox{transfers/sec}' class='latex' />
<p>Think of the cores vs. clock speed decision as a highway. Suppose the clock speed indicates the maximum speed limit on a single lane highway. A faster CPU corresponds to a single lane highway with a high speed limit. You will get to your destination faster. On the other hand, consider a one-lane vs. a two-lane highway, both with identical speed limits. If one lane is too busy for you, take the other lane. An increase in the number of cores increases the number of choices of lanes you can transition to. On the single-lane highway, you would need to slow down and wait for the cars in front you to move forward. By switching lanes, you may get to your destination faster, or you may not, but more driving is completed overall.&nbsp; <em></em></p>
<div class="shr-publisher-763"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F05%2Fwant-to-build-a-research-server-6%2F' data-shr_title='Want+to+Build+a+Research+Server%3F'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/05/want-to-build-a-research-server-6/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Review of 2011 Data Scientist Summit</title>
		<link>http://www.bytemining.com/2011/05/review-of-2011-data-scientist-summit/</link>
		<comments>http://www.bytemining.com/2011/05/review-of-2011-data-scientist-summit/#comments</comments>
		<pubDate>Fri, 13 May 2011 23:34:53 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=661</guid>
		<description><![CDATA[<p></p>
<p>Some time over the past 6 weeks I randomly saw a tweet announcing the &#8220;Data Scientist Summit&#8221; and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the 5.5 hour voyage to Las Vegas.</p>








<p>The Pre-Party</p>
<p>The Venetian and all close hotels were booked, so I ended up at the Aria;  a new experience. The hotel is beautiful and very ritzy. I had heard  that the rooms were very technologically advanced but I wasn&#8217;t prepared  for the recorded welcome message, music and automatic shades opening  upon entry to the room. The Aria is a geek&#8217;s paradise. Everything is  computerized. Key cards are &#8220;waved&#8221; rather than swiped, lights are  turned on/off and dimmed by use case (&#8220;sleep&#8221;, &#8220;read&#8221; etc.), rather than  manually. There are no paper &#8220;Do Not Disturb&#8221; signs; rather, a switch  on the wall (or via TV) toggles an indicator light outside the door. And  the best part&#8230; Internet is FREE!</p>








The rhododendrons hydrangeas are real!
Work [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/05/emc_logo-e1305344955560.jpg" alt="" /></p>
<p>Some time over the past 6 weeks I randomly saw a tweet announcing the &#8220;<a href="http://www.datascientistsummit.com">Data Scientist Summit</a>&#8221; and shortly below it I saw that it would be held in Las Vegas at the <a href="http://www.venetian.com">Venetian</a>. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the <a href="http://maps.google.com/maps?f=d&amp;source=s_d&amp;saddr=Thousand+oaks&amp;daddr=34.29293,-118.85138+to:34.81727,-118.17031+to:las+vegas&amp;hl=en&amp;geocode=FcFmCQIdpq7q-CmRiChwViXogDEmLsxHAXoujQ%3BFcJECwIdzHjq-CnfPIC-8C3ogDEO4O_jNQ4Qww%3BFfZEEwIdOt30-ClvctUz-0bCgDHSpZcovRQoFw%3BFdYQJwIdMJoi-SnRffWkgre-gDGjebPV5tXMOg&amp;mra=dpe&amp;mrsp=1&amp;sz=9&amp;via=1,2&amp;sll=34.18227,-118.484802&amp;sspn=0.790695,2.90863&amp;ie=UTF8&amp;ll=34.445424,-118.210144&amp;spn=0.788221,2.90863&amp;t=h&amp;z=9">5.5 hour voyage to Las Vegas</a>.</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/sign.jpg" alt="" width="250px" height="307px" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/logo.jpg" alt="" width="544px" height="307px" /></td>
</tr>
</tbody>
</table>
<p><strong>The Pre-Party</strong></p>
<p>The Venetian and all close hotels were booked, so I ended up at the <a href="http://www.arialasvegas.com/">Aria</a>;  a new experience. The hotel is beautiful and very ritzy. I had heard  that the rooms were very technologically advanced but I wasn&#8217;t prepared  for the recorded welcome message, music and automatic shades opening  upon entry to the room. The Aria is a geek&#8217;s paradise. <em>Everything </em>is  computerized. Key cards are &#8220;waved&#8221; rather than swiped, lights are  turned on/off and dimmed by use case (&#8220;sleep&#8221;, &#8220;read&#8221; etc.), rather than  manually. There are no paper &#8220;Do Not Disturb&#8221; signs; rather, a switch  on the wall (or via TV) toggles an indicator light outside the door. And  the best part&#8230; <em>Internet is FREE!</em></p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/rhodo-e1305328956151.jpg" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/geek1-e1305329382179.jpg" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/geek2-e1305329434813.jpg" alt="" /></td>
</tr>
<tr>
<td><span style="font-size: x-small;">The <span style="text-decoration: line-through;">rhododendrons</span> hydrangeas are real!</span></td>
<td><span style="font-size: x-small;">Work desk panel contains Ethernet, power, USB, VGA, audio.</span></td>
<td><span style="font-size: x-small;">Cables, provided you want to pay the minibar charge.</span></td>
</tr>
</tbody>
</table>
<p><strong>Data Scientist Summit, Day 1</strong></p>
<p>I arrived to the conference room and quickly took my seat. Seated in the close vicinity were several familiar faces. I also finally got a chance to meet <a href="http://www.drewconway.com/zia/">Drew Conway</a> (<a href="http://twitter.com/drewconway">@drewconway</a>) and David Smith (<a href="http://twitter.com/revodavid">@revodavid</a>), both happened to sit in the row in front of me. The keynote by <a href="http://www.itleadershipacademy.com/Thornton_May.html">Thorton May</a> provided a lot of humor that kicked off a very energetic event. In the second session, we heard from data scientists and team from <a href="http://bloom.io">Bloom Studios</a>, <a href="http://www.23andme.com">23andMe</a>, <a href="http://www.kaggle.com/">Kaggle</a> and <a href="http://www.google.com">Google</a>. I was happy to see somebody from Google present, as they never seem to attend these type events (neither does Facebook).</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/Screen-shot-2011-05-13-at-4.39.59-PM-e1305330697869.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/23andme-e1305330883395.jpg" alt="" width="79" height="54" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/Kaggle_logo-e1305330739136.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/ps_logo2-e1305330788421.png" alt="" /></td>
</tr>
</tbody>
</table>
<p>There has been a lot of buzz about 23andMe and Kaggle in the past few months. It is hard to keep up with all of the buzz, so it was great to hear from the companies themselves. <a href="http://www.23andme.com">23andMe</a> provides users with a kit containing a test tube into which the user spits. The kit is then sent back to 23AndMe labs which analyzes something like 500,000 to a million different markers (I am not a biologist) and can provide information about what markers are present such as: predisposition to diabetes or cancer etc. In 2011, it costs about $5,000 to do this analysis whereas 10 to 20 years ago the figure was in the millions. 23andMe goes a step further. They understand that genetics have a strong association with particular conditions, but that they are not necessarily causal. For example, someone with a predisposition to diabetes will not necessarily contract the disease. 23andMe wants to integrate other data into their models to help <em>predict </em>how likely a patient is to contract a certain condition, given their genetics.</p>
<p><a href="http://www.kaggle.com">Kaggle</a> is a community-based platform for individuals and organizations to submit datasets and open them up to the Data Science community for analysis&#8230;as a competition. I love the geekiness of this endeavor, and it continues where the <a href="http://www.netflixprize.com/">Netflix Prize</a> left off. Kaggle has some awesome prizes for winning the competition such as $3M for the <a href="http://www.heritagehealthprize.com/c/hhp">Heritage Health Prize</a>. There are other freebies as well, such as <a href="http://info.revolutionanalytics.com/Kaggle.html">Revolution R Enterprise free for competitors</a>.</p>
<p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/05/ucsbwave-e1305331068331.jpg" alt="" /> As a disclaimer, I am not a huge visualization guy. I see its importance and usefulness in educating end-users about statistical results, and there are quite a few infographics that are exciting to me. However, there are many times when a boring ol&#8217; boxplot works better than a <a href="http://www.processing.org">Processing</a> applet. So, it takes quite a bit to get me excited about cutting-edge graphics. The <em>Immersive Data Visualization </em>session by <a href="http://www.create.ucsb.edu/~musjkm/">Dr. JoAnn Kuchera-Morin</a> from <a href="http://www.ucsb.edu/">UC Santa Barbara</a> did exactly that. They have created a large metal sphere, called <a href="http://www.allosphere.ucsb.edu/">AlloSphere</a>, containing a bridge in the center where researchers/analysts stand. Their data is projected, for the eye, throughout the ball in a 3D, or 3D-like world. Of course, data can be represented several ways to the eye: color, size, shape, texture, etc. AlloSphere also represents data using the other senses, particularly sound. In her presentation, JoAnn took us on a 3D tour of her colleague&#8217;s brain (<a href="http://en.wikipedia.org/wiki/Fmri">fMRI</a>). Of course, we could &#8220;see&#8221; the inside of the brain, but we could also <em>hear </em>the blood pressure change in different parts of the brain, indicating differing activities. There were some other demonstrations of studies from physics, but I cannot comment on those because I lost interest (physics has always been my worst subject). I attended UC Santa Barbara for one year after high school, so I am particularly proud of what they have done.<br />
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/cal-e1305331523970.jpg" alt="" width="66" height="52" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/deloitte_logo-e1305331761441.jpg" alt="" width="129" height="24" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/oreilly-e1305331685411.jpg" alt="" /></td>
</tr>
</tbody>
</table>
<p>Of all the presentations on the first day, <em>Data Scientist DNA </em>was my favorite. In this panel, <a href="http://www.kaggle.com/pages/team">Anthony Goldbloom</a> of Kaggle, Joe Hellerstein from <a href="http://www.berkeley.edu">UC Berkeley</a>, David Steier from <a href="http://www.deloitte.com">Deloitte</a> and <a href="http://www.oreillynet.com/pub/au/2717">Roger Magoulas</a> from <a href="http://www.oreillynet.com">O&#8217;Reilly Media</a> discussed what makes a good Data Scientist or &#8220;data ninja&#8221; as stated in the program. All were in agreement that candidates should have an understanding of Probability and Statistics, although someone on the panel suggested that a &#8220;basic&#8221; background was all that was needed; I disagree with that. A Data Scientist should also be a proficient programmer in some language, either compiled or interpreted and understand at least one statistical package. More importantly, the panel stressed that above and beyond knowledge, it is imperative that a Data Scientist be willing to learn new tools, technologies and languages on the job. Dr. Hellerstein suggested some general guidelines in classes students should take: Statistics (I argue for a full year of upper division statistics, and graduate study), Operating Systems, Database Systems and Distributed Computing. My favorite quote from the panel came from David Steirer, &#8220;you don&#8217;t just hire a Data Scientist by themselves, you hire them onto a team.&#8221; I could not agree more. Finally, the moderator of the panel suggested that Roger Magoulas may have been the one to coin the term &#8220;big data&#8221; in 2005, but a Twitter follower found evidence that <a href="http://t.co/EqXnTRC">the term has been used since as early as 2000</a> (Thanks Amund! <a href="http://twitter.com/atveit">@atveit</a>).</p>
<p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/05/Code-for-America-e1305331916178.jpg" alt="" /> The last session of the day was given by <a href="http://codeforamerica.org/author/jen/">Jennifer Pahlka</a> from <a href="http://codeforamerica.org">Code for America</a> titled <em>Imagining and Enabling a Better World. </em>Pahlka started her talk by stating that the milennial generation is the most &#8220;pro-government&#8221; generation of the modern day. Regardless of politics, millennials see potential in the goverment and that it can be used for good. Jennifer compared <em>Code for America </em>to <em>Teach for America </em>for Data Scientists. The goal of Code for America is to put together very bright minds to tackle local, state and federal government issues using data. Pahlka brilliantly stated, &#8220;we don&#8217;t need guns, we need geeks. We are trying to create a geek army.&#8221;</p>
<p>During the end of day cocktail reception, I scored two posters of data visualizations: &#8220;super powers&#8221; and &#8220;game controllers over the years.&#8221; The other two posters offered were &#8220;beers&#8221; and &#8220;rappers.&#8221; I also had a chance to quickly meet <a href="http://oreilly.com/oreilly/tim_bio.html">Tim O&#8217;Reilly</a>, Founder and CEO of O&#8217;Reilly Media, whose books are my favorite for learning programming languages and technologies (the animal books).</p>
<p><strong>Data Scientist Summit, Day 2</strong></p>
<p>Personally, I enjoyed the second day more than the first day but that may have been due to the fact that I got sleep the night before. </p>
<p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/05/wefeelfine-e1305332076350.gif" alt="" /> It seemed that the highlight of the morning was the talk by <a href="http://www.number27.org/">Jonathan Harris</a> titled <em>The Art and Science of Storytelling</em>. He introduced his project &#8220;<a href="http://www.wefeelfine.org/">We Feel Fine</a>&#8221; which is a conglomeration of emotions. His project aims to capture the status of the human condition. This was more of the touchy-feely kind of presentation which is different from most of the Data Science talks. He showed beautiful user interfaces and great examples of fluid user experience. Some statistics that caught my eye regard human emotion over time. It seemed that people experienced loneliness earlier in the week than later in the week. Joy and sadness were approximately inversely related throughout the week and hours of the day, but I cannot remember the direction of the trends. The most interesting graphics involved the difference between &#8220;feeling fat&#8221; and &#8220;being fat.&#8221; States like California and New York were hot spots for &#8220;feeling fat&#8221;, but they are actually some of the skinniest states. Instead, the region between the Gulf of Mexico and the Great Lakes was actually the fattest, but did not feel that way. A graphic for &#8220;I feel sick&#8221; showed a hotspot in Nevada which I thought was very interesting (nuclear fallout? alochol poisoning in Vegas?). The interesting part of this discussion was that it showed the vast geography of the field called Data Science. Some Data Scientists are more of the visualization and human connection variety, and others (where I consider myself) are more of the classic geeks that like to write code and dig into the data to get a noteworthy result. Well, I guess there isn&#8217;t much difference between both camps after all. As Jonathan would probably say, Data Science is about storytelling.</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/linkedin-e1305333038887.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/mechanicalturk-e1305333062307.gif" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/factual-e1305330766477.png" alt="" width="54" height="56" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/booz_allen_hamilton-e1305332849283.jpg" alt="" width="138" height="16" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/karmasphere.jpg" alt="" /></td>
</tr>
</tbody>
</table>
<p>The next few sessions got a bit blurry (as is Data Science); they talked about various interconnected topics. <a href="http://www.datawrangling.com/about/">Pete Skomoroch</a> from <a href="http://www.linkedin.com">LinkedIn</a>, Sharon Franks Chiarella from <a href="https://www.mturk.com/mturk/welcome">Amazon Mechnical Turk</a>, <a href="http://www.crunchbase.com/person/gil-elbaz">Gil Elbaz</a> from <a href="http://www.factual.com">Factual</a> and <a href="http://www.oreillynet.com/pub/au/2972">Toby Segaram</a> from Google discussed the fact that you can&#8217;t turn data into a story without joining the data with, well, other data. Another major topic discussed was how to get labeled data, and this is where Mechnical Turk stands out as a data resource. The next talk was humorously titled <em>Hadoop &#8211; The Data Scienist&#8217;s Dream. </em>I know some people that would gouge their eyes out when seeing that title. Really, Map-Reduce is the Data Scientist&#8217;s dream, but yeah, yeah, I know, Hadoop is the first widely accepted implementaton. Paul Brown from <a href="http://www.boozallen.com/">Booz Allen Hamilton</a> and Martin Hall from <a href="http://www.karmasphere.com">Karmasphere</a> discussed how Hadoop is typically being used in production and the briefly mentioned how Hadoop&#8217;s cousins make the Hadoop ecosystem more powerful.</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/SAS_logo-e1305333095334.jpg" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/informatica_logo-e1305333117665.jpg" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/cloudscale.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/revolution-e1305333148659.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/zementis-e1305333177380.gif" alt="" /></td>
</tr>
</tbody>
</table>
<p>The last session in this trifecta was titled <em>The Data Scientist&#8217;s Toolset &#8211; The Recipes that Win</em>. Representatives from various companies were panelists: <a href="http://www.sas.com">SAS</a>, <a href="http://www.informatica.com/Pages/index.aspx">Informatica</a>, <a href="http://www.cloudscale.com/">Cloudscale</a>, <a href="http://www.revolutionanalytics.com/">Revolution Analytics</a>, and <a href="http://www.zementis.com">Zementis</a>. I felt that this discussion was lacking. The strength of the Data Science community stems from open-source technology I believe, and except for Revolution Analytics, none of the companies have a strong reputation in the open-source community yet. Discussion seemed to focus too much on enterprise analytics (SQL, SAS, <a href="http://www.greenplum.com/">Greenplum</a> etc.) and Hadoop, and not enough on analysis and visualization. All in all, this panel was a bit too &#8220;enterprisey&#8221; for me. Some Twitterers felt that they were pushing their products too much. This was surprising because I felt the exact opposite, unless they were picking up on the &#8220;enterprisey&#8221; vibe. The panelists were asked what one tool for data science they would choose of they were on a desert island. The panelists responded with the following tools, &#8220;Perl, C++, Java, R [sic, thanks David], SQL and Python.&#8221; I was disappointed that SQL was mentioned without a countermention for NoSQL because not all data fits in a nice rectangle called a table. By itself, SQL is very limited. Python and R I definitely agree with. Perl is dated, but still has a use in the Data Scientist&#8217;s toolbox if the user is not familiar with Python, and doesn&#8217;t want to be. I was baffled by the C++ response and the lack of overlap in the other responses. But these are my opinions only.</p>
<p>The Summit Spotlight, <em>Secrets of Attribution &#8211; The Stories Beyond the Last-Click </em>discussed how researchers are trying to use data to &#8220;give credit&#8221; to not only the site that referred the user to a resource via a click, but all of the sites in the path that lead to that click, the so-called &#8220;conversion path&#8221; in SEO land. The final session, <em>Building Data Science Firepower &#8211; Taking the Leap</em> was very similar to the <em>Data Scientist DNA </em>talk but added in some food for thought. There are two philosophies for hiring and working with Data Scientists. The first is to hire a strong data science team, and the second is to enhance each team with Data Scientists.</p>
<p><strong>2011 EMC Data Hero Awards</strong></p>
<p>At the end of the summit, the <a href="http://www.greenplum.com/media-center/big-data-use-cases/data-hero-awards">recipients of the EMC Data Hero Awards were announced</a>. I missed some of the honorable mentions, but here goes:</p>
<ul>
<li><em>Consumer Services, </em> LinkedIn.</li>
<li><em>Energy, </em>Silver Springs Networks.</li>
<li><em>Heath Care, </em>Jeffrey Brenner, The Camden Coalition.</li>
<li><em>Life Sciences, </em>The Broad Institute of MIT and Harvard.</li>
<li><em>Media</em>, CMU Create Lab.</li>
<li><em>Public Services, </em>Global Virus Forecasting Initiative.</li>
<li><em>Technology Application, </em>IBM Watson Computing System.</li>
<li><em>Technology IT Infrastructure, </em>Apache Foundation, Hadoop.</li>
<li><em>Visionary, </em>Vivek Kundra, CIO, Data.gov.</li>
</ul>
<p>Vivek Kundra was not present at the summit, but recorded a message to the attendees, which was really cool. He stated that in 2009, there were only 47 government datasets publicly available; in 2011, there are close to 400,000 datasets available to the public.</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/logo_zynga-e1305335191505.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/tableauSoftware01-e1305335214926.jpg" alt="" /></td>
</tr>
</tbody>
</table>
<p>There were some interesting honorable mentions. <a href="http://www.zynga.com">Zynga</a> received an honorable mention for Consumer Services. As a player of <a href="http://www.farmville.com">Farmville</a> and <a href="http://www.cityville.com">Cityville</a>, I can see the plethora of data that Zynga must work with. Additionally, Zynga has some very creative ways for advertising for brands such as McDonald&#8217;s and Tostitos (with Farmville items for both companies), 7-11&#8242;s new slurpee (<a href="http://blog.games.com/2011/01/07/play-cityville-and-zynga-will-unlock-goji-berries-in-farmville/">seeds for the Goji Berry</a>), and <a href="http://www.zynga.com/ladygaga/">GagaVille</a>. Zynga also participates in community service: &#8220;<a href="http://www.zynga.com/about/article.php?a=20100512">Sweet Seeds for Haiti</a>&#8221; (pay to plant special seeds, with proceeds to Haiti) just to name one.</p>
<p><a href="http://www.tableausoftware.com/">Tableau Software</a> also received an honorable mention for the Media category. Tableau develops data visualization software, and is picking up huge steam in the data viz community.</p>
<p>The conference ended with an awesome video created by EMC called &#8220;I Am a Data Scientist&#8221; featuring several <a href="http://www.emc.com">EMC</a> Data Scientists, most of which I happened to have lunch with!</p>
<p><strong>Overall Impression</strong></p>
<p>All in all, the Data Scientist Summit was an eye-opening and empowering event, and it was only planned in six weeks. There was a great sense of community and collaboration among those in attendance. I work as a Data Scientist professionally because I love it. The one fact that I tend to overlook is that Data Scientists are in high demand and short supply. I was reminded of how important our work as Data Scientists is.</p>
<p>This was the <em>first annual </em>Data Scientist Summit, and I will no doubt be back. With that said, discussion of technical topics had a bit of an introductory flavor to them, which made the discussion of the technology seem dated. For example, &#8220;Vanilla&#8221; Hadoop was introduced as a tool for processing vast amounts of data. I would expect that most Data Scientists have worked with Hadoop, or at least know what it is. Hadoop is somewhat old news in terms of &#8220;cutting-edge technology.&#8221; Tools like <a href="http://pig.apache.org/">Pig</a>, <a href="http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html">Cascalog</a>, <a href="http://hbase.apache.org/">HBase</a>, <a href="http://hive.apache.org/">Hive</a>, <a href="http://www.cascading.org/">Cascading</a>, etc. would have been better discussion topics. I was also disappointed with how little coverage that data mining tools there was (except for <a href="http://hadoop.apache.org/">Hadoop</a>, <a href="http://en.wikipedia.org/wiki/NoSQL">NoSQL</a>, and enterpise databases). It seemed as if <a href="http://www.r-project.org">R</a> had gone M.I.A. and I was surprised that there was such little discussion of visualization tools like Tableau, Processing, <a href="http://gephi.org/">Gephi</a>, <a href="http://mbostock.github.com/d3/">D3</a>, <a href="http://polymaps.org/">Polymaps</a>, etc.</p>
<p>The Data Scientist Summit set a very solid foundation for the future. I felt like the modus operandi was &#8220;here is why Data Science is cool&#8221; and &#8220;here is why others should be interested.&#8221; Although this is not a groundbreaking discussion, it sets the stage for future conferences and solidification of the community. The people that probably got the most value out of the <em>technical</em> discissions were people looking to switch careers, or enter Data Science.</p>
<p>Without a doubt I will be at next year&#8217;s Data Scientist Summit!</p>
<p><em>My thoughts and opinion on this blog do not reflect those of my employer, the Rubicon Project.</em></p>
<div class="shr-publisher-661"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F05%2Freview-of-2011-data-scientist-summit%2F' data-shr_title='Review+of+2011+Data+Scientist+Summit'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/05/review-of-2011-data-scientist-summit/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>EC2 Trials and Tribulations, Part 1 (Web Crawling)</title>
		<link>http://www.bytemining.com/2011/05/ec2-trials-and-tribulations-part-1-web-crawling/</link>
		<comments>http://www.bytemining.com/2011/05/ec2-trials-and-tribulations-part-1-web-crawling/#comments</comments>
		<pubDate>Wed, 11 May 2011 17:00:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=632</guid>
		<description><![CDATA[<p></p>
<p>Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user &#8220;boots&#8221; up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and Elastic MapReduce extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.</p>
<p>Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with twill. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and  &#8220;gotchas&#8221; that are important to keep in mind when using EC2, and in this post, with [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img src="http://www.bytemining.com/wp-content/uploads/2011/05/logo_aws.gif" alt="" /></p>
<p><a href="http://aws.amazon.com/ec2/">Elastic Compute Cloud (EC2)</a> is a service provided a <a href="http://aws.amazon.com/">Amazon Web Services</a> that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user &#8220;boots&#8221; up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and <a href="http://aws.amazon.com/elasticmapreduce/">Elastic MapReduce</a> extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.</p>
<p>Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with <a href="http://twill.idyll.org/">twill</a>. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and  &#8220;gotchas&#8221; that are important to keep in mind when using EC2, and in this post, with using  parallelism in your code to accomplish large tasks.</p>
<p><strong>Monitor your Instances</strong></p>
<p>Monitoring your instances has two important benefits. First, to make sure that you are not maxing out resources on the machine. EC2 is &#8220;elastic.&#8221; With some clever programming, you can boot up more machines if you notice resources becoming scarce on your current machines, and then decommission them later when they are not needed.<strong>&nbsp;</strong> I did not do this at first, and I ran into several issues.</p>
<p><em>Disk Space. </em>The concept of a &#8220;disk&#8221; is very confusing in EC2. The <a href="http://en.wikipedia.org/wiki/Amazon_Machine_Image">AMI</a> forms a disk, sort of. Above and beyond the OS and any other software and packages you may install as part of the AMI, you can use whatever free space is remaining to store output files. The total disk space used by the AMI seems to be configured at the moment the AMI is constructed. <strong>Thus, it is not a good idea to store files in the instance. </strong>I did this. Fortunately, I found out before it was too late that my &#8220;disk&#8221; was filling up. I wrote a <a href="http://en.wikipedia.org/wiki/Cron">cron job</a> to copy all of my output files to <span style="font-family: courier new,courier;">/mnt</span> every five minutes. <strong>Use <span style="font-family: courier new,courier;">/mnt</span> to store your files as it has lots and lots of space; HOWEVER if you terminate your instance, the files are gone. This is still true if you use space within the instance.</strong> Once your job completes, upload your files to <a href="http://aws.amazon.com/s3/">S3</a>. <span style="font-family: courier new,courier;"><a href="http://s3tools.org/s3cmd">s3cmd</a></span> allows access to S3 from the command line, and <a href="https://github.com/pcorliss/s3cmd-modification">with the modification here</a>, you can upload and download files in parallel (a life saver for big batches). Another option is to create an <a href="http://aws.amazon.com/ebs/">EBS volume</a>, mount it, and write files directly to the EBS volume. EBS space is much more expensive than S3 space.</p>
<p><em>Memory. </em>On my first attempt, I maxed out memory to the point that the OS killed 6 of my 8 processes. This caused a huge blow in the performance of my crawler and rendered the extra money I spent on an extra large instance wasted. Monitor your job&#8217;s memory using <span style="font-family: courier new,courier;">top</span>. If memory usage seems to grow too fast to your liking, consider using a <a href="http://stackoverflow.com/questions/110259/python-memory-profiler">memory profiler</a> to make sure that there are no memory leaks in your code. I have found that long running Python processes eat up a lot of RAM, even if there are no explicit growths of data structures.</p>
<p>Additionally, maxing out RAM means that the disk will begin to swap. This is devastating to performance because this extra grinding of the disk decreases the total I/O throughput your job can handle. This is crucial for crawlers as files need to be written to disk quickly.</p>
<p>If after profiling you find that your job is still using too much RAM, consider caching, or using a high memory EC2 instance.</p>
<p><em>I/O Throughput. </em>How fast your job consumes and produces data is a good way to determine if something is going amiss in your job, or with the other resources you are using. When I started my crawling job, I was crawling <em>n</em> pages per hour, but after twelve hours, the throughout decreased exponentially until it got so slow that I had to add more instances. <strong>One way to monitor throughput is to save the results of <span style="font-family: courier new,courier;">ls -latr &#8211;full-time</span> to disk and extract the date/time of each file. Using a tool like R, you can quickly plot your I/O throughput over time using an <span style="font-family: courier new,courier;">aggregate()</span>. </strong>A decrease in I/O throughput can be the result of many things: 1) swapping from exhausting RAM, 2) low disk space, 3) network congestion within AWS, 4) poor resource performance (if crawling, the resource would be the website being crawled), 5) hammering an external resource and/or HTTP throttling, 6) congestion in the Internet. <strong>For crawling, you may want to consider using several smaller instances rather than fewer larger instances. This way, you will be accessing the resource from many IPs and the result of being throttled should be lessened. Additionally, use instances that have &#8220;High&#8221; I/O performance; some are rated &#8220;Moderate&#8221; or &#8220;Low.&#8221;</strong><em>&nbsp;</em></p>
<p><em>CPU. </em>A general rule of thumb is that you can run <em>n </em>processes in parallel, for <em>n </em>cores. Additionally, if each core supports <a href="http://en.wikipedia.org/wiki/Hyper-threading">hyperthreading</a> then the number of processes you can run is approximately <em>2n</em>.  If you run more than the suggested number, the price of context  switching can slow down your performance. If you find the need to  routinely exceed this guideline, use an instance with more cores.<em><br />
 </em></p>
<p>When running parallel code, routinely do a <span style="font-family: courier new,courier;">ps aux | grep processName</span> to make sure the correct number of processes is running. If any were  killed, this will be noted in <span style="font-family: courier new,courier;">/var/syslog</span> with a reason.</p>
<p><em>Financial metrics. </em>Are you getting your money&#8217;s worth? Are you really using all of the cores you are paying for? Are you really using all of the memory you are paying for? This is up to you and your budget to dictate. But do not get carried away and assume that you must stay with the same instance size. Most AMIs can run on different instance types (except 64bit AMIs are restricted to m1.large and bigger).<em><br />
 </em></p>
<p><strong>Quarantine Essential Services</strong></p>
<p>My crawler used <a href="http://redis.io/">Redis</a> as a work queue. Each process could easily write new thread IDs and page numbers to the queue as they are discovered, and read thread IDs and page numbers from the queue as each process is ready to crawl a page. One problem that I faced was that I coupled the crawling operation with queue management into the same script, and ran the Redis server on a server where a crawler was running. This coupling posed two challenges. First, it can sustain nasty bugs. Whenever a process was created on the master Redis node, my code would wipe the Redis queue clean to prepare it for crawling (bug!). <strong>Flushing the queue, and the initial population of the queue should have taken place in two separate scripts. </strong>Due to my major bug, I wiped the entire queue clean in the middle of the  crawl. Fortunately, I followed the advice in the next section.</p>
<p>Second, I had to be careful that my processes did not exceed RAM limitations. Because Redis is mainly an in-memory key-value store, it itself can hog up most of the RAM in the instance. <strong>For this reason, it is best to quarantine essential services such as queues to their own instances.</strong></p>
<p><strong>Document Everything</strong></p>
<p>Log <em>everything</em>. Log every resource you are going to use (URLs for a crawl) and log everything that was done and any problems that arise. Using the directory structure (ls) as well as a log of what work was already performed, I was able to reconstruct and repopulate the work queue and essentially start where I left off. For my crawling operation, I wrote the following events to logs, each with a timestamp.</p>
<ul>
<li>Starting the crawl.</li>
<li>Logging in to the site being crawled.</li>
<li>Clearing and populating the queue.</li>
<li>Visiting a thread&#8217;s first page.</li>
<li>Discovering the number of pages of posts in the thread/inserting to the queue.</li>
<li>HTTP redirects, when a thread has been moved.</li>
<li>Visiting a thread ID that does not exist.</li>
<li>Inadvertent logouts, marking work to be redone.</li>
<li>Queuing inconsistencies.</li>
</ul>
<p>An <em>activity log </em>verbosely documented everything that occurred without logging actual data. An <em>inventory manifest </em>indicated which URLs/forum posts had valid content and how many pages of content were associated with them. A standard <em>directory listing </em>indicated what work had been done. By cross referencing the manifest and the directory listing, it is easy to see which posts had not yet been processed. A <em>system log </em>prepared by the operating system also documents critical failures for you, such as lack of disk space or processes being killed.</p>
<p>When writing your logs, use the advice in the next section!</p>
<p><strong>Take Care to not Clobber Files and Objects<br />
 </strong></p>
<p>It&#8217;s been said over and over again. Each process should hold as much of its own real estate as possible. When two or more processes write to the same object, corruption can occur unless there is a locking mechanism in place. If two processes write to the same file at the same time, you will notice garbled entries in your logs. This did not affect my crawled data because each file was written by a single process. The same can be true for reading data as well. When spawning multiple processes, I shared the same Redis connection with all of the processes. If two processes read from the queue at the same time, one process would get the correct data (a thread ID and page number) and the other would get &#8220;OK&#8221;, which was the result of the first process&#8217; fetch operation. This is mostly my fault, but partially <a href="https://github.com/andymccurdy/redis-py"><span style="font-family: courier new,courier;">redis-py</span></a>&#8216;s fault for filling some buffer between Python and Redis with meaningless information (&#8220;OK&#8221;).</p>
<p>Each process should write is own log files. When opening a file, you can use the following:</p>
<pre class="brush: python; title: ; notranslate">
import os
 OUT = open(&quot;mylogfile-%s.log&quot; % str(os.getpid()), &quot;w&quot;)
 ...
 OUT.close()
</pre>
<p><strong>Crawler Specific: Set an Upper Bound</strong></p>
<p>Crawling is fun, but you must practice moderation or it is easy to attempt to boil the ocean. When I first started, I would run a crawl, have it crash, and then deem the data out of date and start over from the beginning and crawl until there was nothing possible left to crawl. It is good to set an upper bound: &#8220;I will crawl 10 days worth of data&#8221;, or &#8220;I will only use threads created prior to May 1, 2011.&#8221; <strong><br />
 </strong></p>
<p>One of the keys to success with EC2 is to get over the penny pinching. If you have a project, just take the plunge and do it on EC2 (if required). The amount you spend on the first few projects will save you more on future projects.</p>
<div class="shr-publisher-632"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F05%2Fec2-trials-and-tribulations-part-1-web-crawling%2F' data-shr_title='EC2+Trials+and+Tribulations%2C+Part+1+%28Web+Crawling%29'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/05/ec2-trials-and-tribulations-part-1-web-crawling/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Location Tracking on Android, too!</title>
		<link>http://www.bytemining.com/2011/04/location-tracking-on-android-too/</link>
		<comments>http://www.bytemining.com/2011/04/location-tracking-on-android-too/#comments</comments>
		<pubDate>Sat, 23 Apr 2011 19:37:26 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=622</guid>
		<description><![CDATA[<p>This week it was revealed that the iPhone stores users&#8217; locations, and this immediately caused a huge firestorm of commentary by tech geeks, panic among privacy advocates, and delight to data geeks like myself. Even better/worse, it seems that the iPhone caches location traces long-term, possibly back to the date the phone was activated.</p>
<p>I ditched my iPhone this past December (good riddance) in favor of the Droid X (Android). I figured, on such an open source OS, Google must be doing the same thing. After surfing through Hacker News, it turns out I was right.</p>
<p>Compared to the iPhone though, getting the data on an Android phone is not simple.</p>

The data is stored in two files, cache.cell and cache.wifi in the directory /data/data/com.google.android.location/files.
First, the user cannot browse this directory by attaching it to a computer. I installed an SSH daemon QuickSSHD to allow remote access into my phone.&#160;
Second, it is not possible to access this directory without getting a Permission denied error, even if logged in as &#8220;root&#8221; as Google has not made this directory readable.
Finally, for those (myself) that are still determined to crack this nut, you will need to root your phone. This makes the &#8220;root&#8221; user a real [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>This week <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">it was revealed that the iPhone stores users&#8217; locations</a>, and this immediately caused a huge firestorm of commentary by tech geeks, panic among privacy advocates, and delight to data geeks like myself. Even better/worse, it seems that the iPhone caches location traces long-term, possibly back to the date the phone was activated.</p>
<p>I ditched my iPhone this past December (good riddance) in favor of the Droid X (Android). I figured, on such an open source OS, Google must be doing the same thing. After surfing through Hacker News, it turns out I was right.</p>
<p>Compared to the iPhone though, getting the data on an Android phone is not simple.</p>
<ul>
<li>The data is stored in two files, <tt>cache.cell</tt> and <tt>cache.wifi</tt> in the directory <tt>/data/data/com.google.android.location/files.</tt></li>
<li>First, the user cannot browse this directory by attaching it to a computer. I installed an SSH daemon <a href="http://teslacoilsw.com/quicksshd">QuickSSHD</a> to allow remote access into my phone.&nbsp;</li>
<li>Second, it is not possible to access this directory without getting a <tt>Permission denied</tt> error, even if logged in as &#8220;root&#8221; as Google has not made this directory readable.</li>
<li>Finally, for those (myself) that are still determined to crack this nut, you will need to root your phone. This makes the &#8220;root&#8221; user a real superuser that has near complete control over the phone.</li>
</ul>
<p>Once I downloaded the files to my Mac (via <tt>scp</tt>), I downloaded this handy-dandy parser from <a href="http://twitter.com/#!/packetlss"><tt>packetlss</tt></a> called <a href="https://github.com/packetlss/android-locdump"><tt>android-locdump</tt></a> and converted the <tt>cache.cell</tt> and <tt>cache.wifi</tt> files into GPX files by passing the <tt>--gpx</tt> flag. You can also leave off the <tt>--gpx</tt> flag and parse the output yourself.</p>
<p>Then I used <a href="http://www.gpsbabel.org/">GPSBabel</a> to convert the GPX files to CSV files and loaded them into R. While this was great for a static view, the lack of interactive zooming makes working with this type of data more difficult. I then used some code from the <a href="http://cran.r-project.org/web/packages/RgoogleMaps/index.html"><tt>RgoogleMaps</tt></a> <a href="http://cran.r-project.org/web/packages/RgoogleMaps/vignettes/RgoogleMaps-intro.pdf">package vignette</a>, and adapted for use by <a href="http://malecki.blogspot.com/2011/04/quick-iphone-location-data.html">Michael Malecki</a>. [Drew Conway has developed <a href="https://github.com/drewconway/stalkR">stalkR</a> for analyzing iPhone and iPad location data in R.]</p>
<pre class="brush: r; title: ; notranslate">
library(RgoogleMaps)
Df &lt;- read.csv(&quot;CSV file&quot;, header=FALSE)
names(Df) &lt;- c(&quot;Latitude&quot;, &quot;Longitude&quot;, &quot;Key&quot;)
bb &lt;- qbbox(lat=range(Df$Latitude), lon=range(Df$Longitude))
m &lt;- c(mean(Df$Latitude), mean(Df$Longitude))
zoom &lt;- min(MaxZoom(latrange=bb$latR,lonrange=bb$lonR))
Map &lt;- GetMap.bbox(bb$lonR, bb$latR, zoom=zoom, maptype=&quot;mobile&quot;,
NEWMAP=TRUE, destfile=&quot;tempmap.jpg&quot;, RETURNIMAGE=TRUE, GRAYSCALE=TRUE)
tmp &lt;- PlotOnStaticMap(lat=Df$Latitude, lon=Df$Longitude,
cex=.7,pch=20,col=&quot;red&quot;, MyMap=Map, NEWMAP=FALSE)
</pre>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2011/04/wla.png" alt="" width="600" height="450" /></p>
<p>The map clusters my activity into a few familiar categories: work, school (Math Sciences Building actually), home, and my parents&#8217;. Android also picked up a dinner outing in Santa Monica, and a trip to the Shopzilla office for the <a href="http://www.meetup.com/LA-HUG/">Los Angeles Hadoop User Group</a> meetup, but little else.</p>
<p><strong>What I Found</strong></p>
<p>The <tt>cache.cell</tt> file uses cell tower triangulation to locate the user. In addition to this imprecise measure, the Android&#8217;s location tracker has several limitations</p>
<ol>
<li>It seems that location is recorded infrequently. I had expected to see trails of activity corresponding to walking or driving. All of my activity is clustered in areas where I am mostly likely stopped (on campus, at work, at home, in Santa Monica, and at the intersection of Gayley and Wilshire which has an excruciatingly painful wait). <strong>The iPhone location history seems to be much more complete/useful.</strong></li>
<li>According to the old Android source, only the last 50 cell locations, and last 200 WiFi locations are recorded (boring). My phone seemed to record more than 50 cell locations (approximately 200), but this is small.</li>
<li>I couldn&#8217;t even convert the <tt>cache.wifi</tt> file because it was apparently empty. This file is apparently cleared when WiFi is disabled.</li>
</ol>
<p>I also found that I need to get out more.</p>
<p><strong>Why Would Apple do Such a Thing?</strong></p>
<p>Earlier iPhone models (up to 2010 apparently) used <a href="http://www.skyhookwireless.com/">Skyhook</a> for its geo-location database. Skyhook employees basically drive cars wired with WiFi sensors and GPS and does what is called &#8220;<a href="http://en.wikipedia.org/wiki/Wardriving">wardriving</a>.&#8221; They drive around cities recording information about the access points it encounters and where it encounters them. When a user logs onto the web via one of those access points, Skyhook customer sites can cross-reference the access point location with its physical location. As of August 2010, Apple dropped Skyhook. Why?</p>
<p>I suspect Apple is using this data to build its own geo-location database, yet there is no evidence that the files on the iPhone are actually being transmitted to Apple. If it is true that the location database is actually transmitted to the user&#8217;s computer, it&#8217;s possible that Apple uses this data from Safari to enable geo-location features in it.</p>
<p>The investigative side of me says that this could be useful in a missing persons case if the phone is dropped.<strong><br />
</strong></p>
<p><strong>Android or iPhone?</strong></p>
<p>Apple and Google pursued different approaches in caching users&#8217; locations. Apple used a standard database file stored on the phone. Although this file is hidden in the phone, it seems to be transmitted to the user&#8217;s computer. The user can then open the file and see what Apple is storing about them. Heck, they could even modify it to privatize it. The iPhone updates this information very frequently, and keeps it around for a very long time. The file is there, the user knows it is there, and the user can see what is in the file. Unfortunately, this also means that people will overreact.</p>
<p>Google, on the other hand, hid the file deep in the filesystem such that a terminal connection is necessary to reach it, and &#8220;rooting&#8221; the phone is necessary to see its content. The user has no idea that this file exists, and cannot see what Google is storing about them. This is a bit shady. On the other hand, the information that Google is collecting is very minimal and has questionable use. Data is not updated often, and is not held on disk for very long. It is also possible to clear at least the WiFi location cache file by turning WiFi off and on.</p>
<p>So, what do you think about all of this?</p>
<div class="shr-publisher-622"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F04%2Flocation-tracking-on-android-too%2F' data-shr_title='Location+Tracking+on+Android%2C+too%21'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/04/location-tracking-on-android-too/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Instructions for Installing 64bit SciPy, Python 2.7.1 on MacOS X 10.6</title>
		<link>http://www.bytemining.com/2011/03/instructions-for-installing-64bit-scipy-python-2-7-1-on-macos-x-10-6/</link>
		<comments>http://www.bytemining.com/2011/03/instructions-for-installing-64bit-scipy-python-2-7-1-on-macos-x-10-6/#comments</comments>
		<pubDate>Mon, 28 Mar 2011 19:00:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Programming Languages]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=586</guid>
		<description><![CDATA[<p>Numpy and SciPy are packages for numerical computation and scientific computing, for Python.</p>
<p>One wrinkle with NumPy/SciPy that needs to be ironed out is the difficulty of installation on certain OSes, and particularly, architectures.The SciPy SuperPack has done a good job of taking care of this issue, but it has not yet been updated for 2.7.1 and manually hacking away at its script has not worked for me.</p>
<p>I cannot take credit for the instructions in this article. A brave warrior, Jeremy Conlin, somehow managed to figure out how to install 64-bit NumPy and SciPy, with 64-bit Python 2.7.1 on Snow Leopard; he posted the directions to the SciPy User mailing list on February 24. I followed the directions, and miraculously they worked. I am reproducing them here for Google bait.</p>
<p>Install Python 2.7.1</p>
<p>1. Download the universal Mac 2.7.1 installer here (Python 2.7.1 Mac OS X 64-bit/32-bit x86-64/i386 Installer). Typically, Python will be installed to /Library/Frameworks/Python.framework/Versions/2.7/, but may be in other locations.</p>
<p>2. Verify that your new version of Python is 64-bit enabled. Note: Python installations typically do not get toggled as the default Python, so find the location of the 2.7.1 Python executable. On my machine, it is /Library/Frameworks/Python.framework/Versions/2.7/bin/python. python2.7 should also work.</p>
<p>Load [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Numpy and SciPy are packages for numerical computation and scientific computing, for Python.</p>
<p>One wrinkle with NumPy/SciPy that needs to be ironed out is the difficulty of installation on certain OSes, and particularly, architectures.The <a href="http://stronginference.com/scipy-superpack/">SciPy SuperPack</a> has done a good job of taking care of this issue, but it has not yet been updated for 2.7.1 and manually hacking away at its script has not worked for me.</p>
<p>I cannot take credit for the instructions in this article. A brave warrior, Jeremy Conlin, somehow managed to figure out how to install 64-bit NumPy and SciPy, with 64-bit Python 2.7.1 on Snow Leopard; he posted the <a href="http://mail.scipy.org/pipermail/scipy-user/2011-February/028567.html">directions to the SciPy User mailing list on February 24</a>. I followed the directions, and miraculously they worked. I am reproducing them here for Google bait.</p>
<p><strong>Install Python 2.7.1</strong></p>
<p>1. Download the universal Mac 2.7.1 installer <a href="http://www.python.org/ftp/python/2.7.1/python-2.7.1-macosx10.6.dmg">here</a> (Python 2.7.1 Mac OS X 64-bit/32-bit x86-64/i386 Installer). Typically, Python will be installed to <tt>/Library/Frameworks/Python.framework/Versions/2.7/</tt>, but may be in other locations.</p>
<p>2. Verify that your new version of Python is 64-bit enabled. <strong>Note: Python installations typically do not get toggled as the default Python, so find the location of the 2.7.1 Python executable. </strong>On my machine, it is <tt>/Library/Frameworks/Python.framework/Versions/2.7/bin/python</tt>. <tt>python2.7</tt> should also work.</p>
<p>Load Python 2.7.1 and execute the code below. If you get <tt>64</tt>, then you are ready to proceed.</p>
<pre class="brush: python; title: ; notranslate">
 import sys
 from math import log
 log(sys.maxsize, 2) + 1
 </pre>
<p>Another way is to execute the following. If you get 1099511627776 (and NOT 1099511627776L) you are in good shape.</p>
<pre class="brush: python; title: ; notranslate">
 2**40
 </pre>
<p>(This tip comes from <a href="http://asmeurersympy.wordpress.com/2009/11/13/how-to-get-both-32-bit/">Aaron Meurer&#8217;s SymPy blog</a>)</p>
<p><strong>Install gfortran</strong></p>
<p>1. Download gfortran-4.2.3 <a href="http://r.research.att.com/gfortran-4.2.3.dmg">here</a> and double-click the file to mount the disk image.<strong>&nbsp;</strong></p>
<p>2. Create a temporary directory (I call it <tt>tmp</tt> here) and run the following commands.</p>
<pre class="brush: bash; title: ; notranslate">
 cd tmp
 mkdir gfortran
 cd gfortran
 pax -zrvf /Volumes/GNU\ Fortran\ 4.2.3/gfortran.pkg/Contents/Archive.pax.gz .
 cp -r usr/local/* /usr/local
 </pre>
<p><strong>Drop Down to the Root Shell</strong></p>
<p><strong>NOTE: </strong>From this point forward, I dropped to the root command prompt (<tt>sudo su -</tt>) so that I had full control over the environment. This is mainly due to the way to the shell deals with <tt>PYTHONPATH</tt> and the complications that can stem from it when installing with root privileges.</p>
<p>Locate the Python 2.7.1 distribution and find a directory that ends in <tt>site_packages</tt>. For example, for me this is <tt>/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/</tt>.</p>
<p>Assuming Bash is your shell of choice, enter the following</p>
<p><tt>export PYTHONPATH=/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/</tt></p>
<p>Or, for tcsh,</p>
<p><tt>setenv PYTHONPATH /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/</tt></p>
<p><strong>Install distribute-0.6.14</strong></p>
<p>The next few steps follow the typical Python package installation method. I use <tt>python2.7</tt> rather than <tt>python</tt> just to make sure that my system is using 2.7.1 and not built in 2.6.1.</p>
<p>Distribute adds tools for installing Python modules including <tt>pip</tt> and <tt>easy_install</tt>.</p>
<pre class="brush: bash; title: ; notranslate">
 curl -O http://pypi.python.org/packages/source/d/distribute/distribute-0.6.15.tar.gz
 tar -xzvf distribute-0.6.15.tar.gz
 cd distribute-0.6.15
 python2.7 setup.py install
 </pre>
<p> <strong>Install nose</strong></p>
<p>Nose is al alternate test discovery and running process for unittest, similar to py.test.</p>
<p>Installation requires <tt>easy_install</tt>. I am not a fan of <tt>easy_install</tt> in the slightest. There are two important things to remember here:</p>
<ol>
<li>PYTHONPATH must be set correctly, or easy_install will complain.</li>
<li>If you have easy_install for another version of Python, <strong>make sure you use the version for 2.7.1 and not some previous version </strong>or easy_install will complain.</li>
</ol>
<p>On my system, easy_install for 2.7.1 is located at <tt>/Library/Frameworks/Python.framework/Versions/2.7/bin/easy_install</tt> so I use:</p>
<pre class="brush: bash; title: ; notranslate">
 /Library/Frameworks/Python.framework/Versions/2.7/bin/easy_install nose
 </pre>
<p> <strong>Install NumPy</strong></p>
<p>Finally, we can install NumPy. Version 1.5.1 is the first version that apparently works with Python 3.</p>
<pre class="brush: bash; title: ; notranslate">
 curl -O http://downloads.sourceforge.net/project/numpy/NumPy/1.5.1/numpy-1.5.1.tar.gz
 tar xzvf numpy-1.5.1.tar.gz
 cd numpy-1.5.1
 sudo python2.7 setup.py install
 </pre>
<p> Then, open Python and execute the following to import the library, and test it. On my system, all tests passed.</p>
<pre class="brush: python; title: ; notranslate">
 import numpy
 numpy.test()
 </pre>
<p> <strong>Install SciPy</strong></p>
<p>Next, install SciPy. Version 0.9 is the first version that is compatible with Python 3.</p>
<pre class="brush: bash; title: ; notranslate">
 curl -O http://downloads.sourceforge.net/project/scipy/scipy/0.9.0/scipy-0.9.0.tar.gz
 tar xzvf scipy-0.9.0.tar.gz
 cd scipy-0.9.0
 sudo python2.7 setup.py install
 </pre>
<p> Then, open Python and execute the following to import the library, and test it. On my system, all but 14 tests passed.</p>
<pre class="brush: python; title: ; notranslate">
 import scipy
 scipy.test()
 </pre>
<p><strong>Install readline</strong></p>
<p>readline provides functions for completion and reading/writing of history files from the Python interpreter (up/down arrow).</p>
<pre class="brush: bash; title: ; notranslate">
 /Library/Frameworks/Python.framework/Versions/2.7/bin/easy_install readline
 </pre>
<p> <strong>Install IPython</strong></p>
<p>IPython provides an enhanced command line interface to the Python interpreter. It is much more pleasant to work with than the standard command line interface.</p>
<pre class="brush: bash; title: ; notranslate">
 /Library/Frameworks/Python.framework/Versions/2.7/bin/easy_install ipython
 </pre>
<p><strong>Install wxPython</strong></p>
<p>wxPython is a graphical user interface toolkit for Python. Up until recently, this was the blocker in the process &#8212; there was no 64-bit version of wxPython for Mac. Download the DMG <a href="http://downloads.sourceforge.net/project/wxpython/wxPython/2.9.1.1/wxPython2.9-osx-2.9.1.1-cocoa-py2.7.dmg">here</a> and double click the installer. Versions for Cocoa and Carbon are provided, but Python 2.7.1 apparently requires the Cocoa version.</p>
<p><strong>Install matplotlib<br />
 </strong></p>
<p>matplotlib provides the plotting interface which is crucial to SciPy. Installing matplotlib is always the most complicated part of the process.</p>
<pre class="brush: bash; title: ; notranslate">
 wget http://downloads.sourceforge.net/project/matplotlib/matplotlib/matplotlib-1.0.1/matplotlib-1.0.1.tar.gz
 tar xzvf matplotlib-1.0.1.tar.gz
 cd matplotlib-1.0.1
 </pre>
<p>Then, open the file <tt>make.osx</tt>, find lines that look like below, and make the appropriate changes</p>
<pre class="brush: bash; title: ; notranslate">
 PYVERSION=2.7
 ZLIBVERSION=1.2.5
 PNGVERSION=1.4.5
 FREETYPEVERSION=2.4.4
 </pre>
<p>Finally, delete line 63. Line 63 looks like:</p>
<pre class="brush: bash; title: ; notranslate">
 cp .libs/libpng.a . &amp;amp;&amp;amp;\
 </pre>
<p><strong>Within the <tt>matplotlib-1.0.1</tt> directory, download the following archives. Do not extract them!</strong></p>
<ol>
<li><a href="http://zlib.net/zlib-1.2.5.tar.gz">zlib 1.2.5</a></li>
<li><a href="http://downloads.sourceforge.net/project/libpng/libpng14/older-releases/1.4.5/libpng-1.4.5.tar.gz">libpng 1.4.5</a></li>
<li><a href="http://download.savannah.gnu.org/releases/freetype/freetype-2.4.4.tar.bz2">freetype 2.4.4</a></li>
</ol>
<p>Or execute the following commands <strong>within </strong>the matplotlib-1.0.1 directory.</p>
<pre class="brush: bash; title: ; notranslate">
 wget http://zlib.net/zlib-1.2.5.tar.gz
 wget http://downloads.sourceforge.net/project/libpng/libpng14/older-releases/1.4.5/libpng-1.4.5.tar.gz
 wget http://download.savannah.gnu.org/releases/freetype/freetype-2.4.4.tar.bz2
 </pre>
<p>Finally, install <tt>matplotlib-1.0.1</tt> using the following command</p>
<pre class="brush: bash; title: ; notranslate">
 PREFIX=$PYTHONROOT make -f make.osx deps mpl_install
 </pre>
<p>where PYTHONROOT is the directory containing the pacakage tree. On my system, it is <tt>/Library/Frameworks/Python.framework/Versions/2.7/</tt>.</p>
<p>Finally, enter Python and test matplotlib.</p>
<pre class="brush: python; title: ; notranslate">
 import matplotlib.pyplot as pyplot
 pyplot.plot([1,2,3])
 </pre>
<p>If installation was successful, a plot like the following should appear.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2011/03/matplotlib.png" alt="" width="400" height="300" /></p>
<p>Hopefully this is helpful to someone, and here&#8217;s to hoping a SciPy Superpack for Python 2.7.1 will be released soon!</p>
<div class="shr-publisher-586"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F03%2Finstructions-for-installing-64bit-scipy-python-2-7-1-on-macos-x-10-6%2F' data-shr_title='Instructions+for+Installing+64bit+SciPy%2C+Python+2.7.1+on+MacOS+X+10.6'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/03/instructions-for-installing-64bit-scipy-python-2-7-1-on-macos-x-10-6/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>My First Few Days with RStudio</title>
		<link>http://www.bytemining.com/2011/03/my-first-few-days-with-rstudio/</link>
		<comments>http://www.bytemining.com/2011/03/my-first-few-days-with-rstudio/#comments</comments>
		<pubDate>Wed, 09 Mar 2011 06:08:24 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=532</guid>
		<description><![CDATA[<p>As most readers are probably aware, the free IDE for R, called RStudio, was recently released for general use and it immediately made huge waves within the R community. IDE stands for Integrated Development Environment. IDEs typically provides a rich set tools developing in some target language. For standard programming languages like C++ (VisualStudio) and Java (Eclipse or NetBeans), IDEs contain:</p>

an editor tailored to the target language. The editor typically has tab/auto-complete for variable names, functions and class methods and properties and also features syntax highlighting.
a multiple document interface (MDI) where there may be several documents opened in different tabs.
a window that interacts with the compiler, or a panel containing the console to the language, a la MATLAB, and even vanilla R&#8217;s GUI.

a debugger
a file browser and language reference.

<p>RStudio plays to this analogy very well, and makes modifications where appropriate. RStudio provides many features that are lacking in the standard R GUI, and improves on features that do not work properly in the Windows R GUI. Over the past few days, I have been doing all of my R analysis within RStudio, shortly with the Desktop version, and mostly with the Server version. I will discuss mostly the server version [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>As most readers are probably aware, the free IDE for <a href="http://www.r-project.org">R</a>, called <a href="http://www.rstudio.org">RStudio</a>, was recently released for general use and it immediately made huge waves within the R community. IDE stands for <a href="http://en.wikipedia.org/wiki/Integrated_development_environment">Integrated Development Environment</a>. IDEs typically provides a rich set tools developing in some target language. For standard programming languages like C++ (VisualStudio) and Java (Eclipse or NetBeans), IDEs contain:</p>
<ul>
<li>an editor tailored to the target language. The editor typically has tab/auto-complete for variable names, functions and class methods and properties and also features <a href="http://en.wikipedia.org/wiki/Syntax_highlighting">syntax highlighting</a>.</li>
<li>a <a href="http://en.wikipedia.org/wiki/Multiple_document_interface">multiple document interface (MDI)</a> where there may be several documents opened in different tabs.</li>
<li>a window that interacts with the compiler, or a panel containing the console to the language, a la MATLAB, and even vanilla R&#8217;s GUI.
</li>
<li>a <a href="http://en.wikipedia.org/wiki/Debugger">debugger</a></li>
<li>a file browser and language reference.</li>
</ul>
<p>RStudio plays to this analogy very well, and makes modifications where appropriate. RStudio provides many features that are lacking in the standard R GUI, and improves on features that do not work properly in the Windows R GUI. Over the past few days, I have been doing all of my R analysis within RStudio, shortly with the <a href="http://www.rstudio.org/download/desktop">Desktop version</a>, and mostly with the <a href="http://www.rstudio.org/download/server">Server version</a>. I will discuss mostly the server version since that is what I have been using. It is identical (AFAIK) to the desktop version, so you are not missing anything by using either version.</p>
<h3>RStudio Server</h3>
<div class="smallbq">The biggest win for me with RStudio is the Server edition. I can  access my work on any system that can communicate with  the server. The  interface always looks the same, and all I need is a  web browser to  access it. Before RStudio Server Edition, I had to run  two versions of R: R GUI on my local machine for graphics and  presentation, and a headless R on a research server for processing,  where the server contained my data and the rest of my workflow. <strong>I no longer need to run multiple versions of R in my workflow.</strong></div>
<p></p>
<p>First, installation is miraculously easy<em>. </em>I only had a few very minor glitches to deal with. Armed with <tt>sudo</tt> access to a machine on a research cluster at work, I was able to simply download the RPM and install it using the <a href="http://www.rstudio.org/download/server">instructions provided on the web site</a>. Then, all I had to do was fire up a browser and go to</p>
<p><tt>http://servername.com:8787</tt></p>
<p>and I was asked for my login credentials. But I couldn&#8217;t get in. This server authenticates using LDAP, but all I had to do was replace the contents of <tt>/etc/pam.d/rstudio</tt> with the contents or <tt>/etc/pam.d/login</tt> and I was able to login. But then there was a &#8220;unknown error.&#8221; Oh, the version of R that was installed was too old (2.8). I just did a <tt>yum upgrade R</tt>, and RStudio logged me in with no problems. What showed up on my browser screen was beautiful! It looked <em>identical </em>to the desktop version of RStudio.</p>
<div align="center"><b><img src="http://www.bytemining.com/wp-content/uploads/2011/03/rstudio.png" alt="" height="65%" width="65%" /><br />
</b></div>
<p>
Once logged in, I somehow have access to ALL of my files on the <em>remote server</em>. I can load my data (typically produced by <a href="http://hadoop.apache.org">Hadoop</a>) already residing on the server, and I can save output, graphs, data and even the R session itself <em>on the remote server! </em>All while just clicking buttons. No commands to remember, no screwed up PDF files, and most importantly&#8230;. no <tt>scp</tt>ing files back and forth from the server just to create a plot (X worked well, but had limitations)!</p>
<p><b>Things I Love about R Studio</p>
<p></b>I will have to go panel by panel, but even then I will have missed cool features. I also will not discuss features that are already present in the MacOS X R GUI and are repeated and beautified in RStudio:<br />
<em>The R command prompt still looks the same. </em>At first, my reaction  was &#8220;Damn, what am I supposed to do?&#8221; But when the GUI finished loading, the familiar R command prompt appeared in all is 1970ish glory. I  immediately started typing commands and seeing fields in the other panes populate and change to display different usages. It left me with a &#8220;oh, I see&#8221; feeling. </p>
<p><em>Saves R sessions correctly</em>, and when I return to RStudio, ALL of my work is there! I could never get the save session/image function to work in R GUI. I gave up several years ago. In RStudio, it works properly, but you don&#8217;t even need it because&#8230; when you leave RStudio and then return, everything is there! The workspace (variables, functions, data, etc), the scripts you were working on, the plots, even the last dang help screen you looked at!</p>
<p><em>The Stop Execution button in the console actually works. </em>When executing a long running computation in R GUI (that&#8217;s the first mistake), it is sometimes necessary to cancel the computation either because I made an error, or because the computation is killing my system&#8217;s performance. In R GUI, particularly on MacOS X, the Stop Execution button did absolutely nothing, because there was typically a spinning beachball preventing me from clicking it. Hitting ESC also did not work. In RStudio, clicking Stop actually seems to break out of the madness.</p>
<div style="vertical-align: middle;">
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/workspace.png" alt="" class="rfloat" height="137" width="371" /><em>Workspace panel.</em> The workspace panel displays the variables, functions, data frames and other objects that reside in the current workspace, a la MATLAB. From this panel, one can also switch or save workspaces. The user can also import a dataset from a text file using a trivial wizard (a la SPSS, etc.), or from a web URL. The user can also clear the workspace. A frequently overlooked command to do the same from the command line is rm(list=ls()), but that is no longer necessary to remember! </p>
<p>Clicking on a data frame object in the workspace pane, causes it to be displayed in a nice tabular format. It can also be printed to a local printer, or opened in a new window.</p>
<p>Clicking on a numerical value allows the user to change it by opening an in-place edit box. Clicking on other objects like lists, vectors and functions opens an edit window displaying the definition that created it.
</div>
<p><b></p>
<p><br style="clear: both;" /></p>
<p></b>
<div style="vertical-align: middle;">
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/files.png" alt="" class="lfloat" height="140" width="459" /><br />
<em>Files panel.</em> There is nothing really exciting to see here, <strong>except that by clicking the Upload button, I can upload files directly to the <em>remote server </em>just by selecting the file, <em>without </em>having to SCP!</strong>
</div>
<p><br style="clear: both;" /></p>
<div style="vertical-align: middle;">
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/script.png" alt="" class="rfloat" height="201" width="456" /><em>Scripting panel.</em> This is the second best feature of R studio and has the same feeling as the stock script editor that ships with R. The largest difference is that the editor in RStudio is stable. On MacOS X, the editor tends to garble 2-3 rows of code together on every single scroll. This editor does a better job of indentation than R GUI. When opening a function, R GUI tends to indent the body properly, but insert a closing } prematurely. RStudio&#8217;s editor also features auto-completion, a feature present in the command-line of R GUI and R, but not in the editor of R GUI. The user can also save their script <em>on the remote server</em>, print code to a local printer and search. Similar to MATLAB, the user can select one or more lines of code and run them by clicking the &#8220;Run Line(s)&#8221; button, rather than having to copy and paste lines. &#8220;Run All&#8221; is a point-and-click replacement for source. </p>
<p>The &#8220;Source on Save&#8221; function is interesting. If enabled, RStudio will run/source the script each time the script is saved. Honestly, I do not find this feature to be all that useful unless in the middle of debugging, and dangerous if not debugging. Suppose after a long 10-fold-cross-validation computation there is an error that we want to fix. We fix the error and save the script. Do we really want to run the computation again? If R were a compiled language, then yes. Since R is not a compiled language, this feature is not entirely useful in concept.</p>
<p>The &#8220;magic wand&#8221; icon contains what I suspect to be a growing collection of coding tools. Currently, the user can comment and uncomment a bunch of lines at once. This is particularly useful since, for some reason, there is no multiline comment flag in R. The user can also select a series of lines and wrap a function around them. This feature could be dangerous for those not familiar with coding but provides a very nice way to put a bunch of code into a function as an afterthought.
</div>
<p><br style="clear: both;" /></p>
<div style="vertical-align: middle;">
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/plots.png" alt="" class="lfloat" height="300px" width="433px" /><br />
<em>Plot panel. </em>By far my favorite part of RStudio is the plot panel! All plots are saved in this panel, and the user move back and forth among plots that were already constructed. <strong>The Export button allows exporting a plot to user defined dimensions and save to the local machine as a PNG, or even copy it to the local machine&#8217;s clipboard! </strong>Of course, the PDF button produces a PDF file of the plot that can be saved on the local machine. If the plots are all too much, we can click &#8220;Clear All&#8221; and start again with a clean slate.</p>
<p>But, is it possible to create plots of larger size? I am sure there is, but I did not spend much time looking.
</p></div>
<p><br style="clear: both;" /></p>
<p><em>LaTeX and Sweave documents.</em> From the File menu the user can create new documents including <a href="http://www.latex-project.org/">LaTeX</a> and <a href="http://www.statistik.lmu.de/%7Eleisch/Sweave/">Sweave</a>. Unfortunately, I cannot experiment more with these features because there is something amiss in my configuration. For students and researchers, having Sweave and LaTeX integrated with RStudio is a huge, huge, huge advantage. No longer must we copy/paste among different programs. <strong>To make the integration complete, <a href="http://www.bibtex.org/">BibTeX</a>, <a href="http://asymptote.sourceforge.net/">Asymptote</a>/<a href="http://www.texample.net/tikz/">TikZ</a>/<a href="http://www.gnuplot.info/">gnuplot</a> whatever should be easily included by the user.</strong><br />
<i><br />
At any point if the user interface shows stale data, there is a  Reload button to help you out by refreshing the entire RStudio interface.</i><br />
<b><br />
Things that Need Improvement</p>
<p></b>I do not really have any complaints about RStudio, quite the opposite actually. However, there are some things that do not seem to work. I should note however, that I have not spent much (well, any) time debugging them. The developers are probably already working on some of them. Some of them are probably problems in my configuration and others are probably settings that I need to tweak.<br />
<b><br />
</b><em>No auto-completion of parentheses or quotation marks. </em>This is a bummer, but not a deal breaker. On the other hand, as you type closing marks, RStudio highlights the matching mark.<br />
<b><br />
</b><em>The dataset view needs work. </em>Columns can&#8217;t be resized. Other natural functionalities that seem to be missing are: column renaming (a call to names), cannot sort or order values by a column, and data manipulation (I didn&#8217;t say that). These missing features are a tad disappointing, but a hell of a lot better than displaying in the terminal.</p>
<p><em>Install packages in the packages panel does not work </em>on our server&#8217;s configuration.<br />
<b><br />
</b><em>LaTeX cannot be found</em>. Upon attempting to create a new LaTeX or Sweave document, I got a friendly notice (instead of a bizarre error message) saying that LaTeX is not installed. The problem is, it is installed and there does not seem to be anywhere in the GUI to configure its location. <strong>Additionally, some LaTeX templates would be useful.</strong></p>
<p><b><br />
</b></p>
<h3>In Conclusion&#8230;</h3>
<p><strong>My Workflow Before and After RStudio</strong></p>
<div>
Before RStudio<br />
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/before.png" alt="" class="lfloat" /><br />
<br style="clear: both;" /><br />
After RStudio<br />
<br style="clear: both;" /><br />
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/after.png" alt="" class="lfloat" />
</div>
<p><br style="clear: both;" /></p>
<div class="smallbq">All in all, the biggest win for me with RStudio is the Server edition. I can access my work on any system that can communicate with the server. The interface always looks the same, and all I need is a web browser to access it. I no longer need to run multiple versions of R in my workflow.</div>
<p>The developers of this open source project seemed to get it right on the first try. How the hell is that possible??? So has anyone switched from the big R to the big blue ball?</p>
<div class="shr-publisher-532"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F03%2Fmy-first-few-days-with-rstudio%2F' data-shr_title='My+First+Few+Days+with+RStudio'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/03/my-first-few-days-with-rstudio/feed/</wfw:commentRss>
		<slash:comments>28</slash:comments>
		</item>
		<item>
		<title>Web Mining Pitfalls</title>
		<link>http://www.bytemining.com/2011/02/web-mining-pitfalls/</link>
		<comments>http://www.bytemining.com/2011/02/web-mining-pitfalls/#comments</comments>
		<pubDate>Thu, 03 Feb 2011 18:00:15 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=510</guid>
		<description><![CDATA[<p>Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.</p>
<p>Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the RSS Advisory Board developed a convention for the formatting of web pages so that browsers can automatically discover the links to the site&#8217;s RSS feeds. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.</p>
<p>Always Have a Plan B, C, D, &#8230;</p>
<p>One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I am always leery [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.</p>
<p>Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the <a href="http://www.rssboard.org">RSS Advisory Board</a> developed a <a href="http://www.rssboard.org/rss-autodiscovery">convention for the formatting of web pages so that browsers can automatically discover the links to the site&#8217;s RSS feeds</a>. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.</p>
<p><strong>Always Have a Plan B, C, D, &#8230;</strong></p>
<p>One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I am always leery of bias. Could there be something special about these sites that do not implement RSS autodiscovery? Clearly, there are exceptions to my Plan A. So time to move to Plan B. I found that some of the sites in this 5% used FeedBurner to index their feeds, so Plan B was to use a regular expression to extract FeedBurner URLs. This only added another 1% (actually less than that) to my coverage.</p>
<pre class="brush: python; title: ; notranslate">((?:http|feed)://feeds.feedburner.com/.*?)</pre>
<p>Next, Plan C took the domain name and simply slapped /feed to it and hope it sticks. I called this process &#8220;feed probing&#8221; and it added the remaining 3% that I was looking for. If Plans A, B and C all failed to find a suitable RSS feed, all hope is lost and we just skip this site (1% error).</p>
<p><em>On the other hand, there are  times when it is the HTTP client or server cannot be trusted&#8230;</em></p>
<p><strong>Common Python Exceptions in Web Mining</strong></p>
<p>It is all too common to encounter an exception while web mining or crawling. Code must handle these errors gracefully by catching exceptions or failing without aborting. One method that works well is to provide a resume mechanism that restarts execution where the code left off, rather than having to start a multiple hour/day/week job over again! Below is a taxonomy of common problems (and their Python exceptions):</p>
<p><em>HTTP Errors. </em>These occur frequently. Some are recoverable, and others are worth just throwing out a record over. The most common ones are below, but for more information, refer to <a href="http://www.w3.org/Protocols/rfc2616/rfc2616.html">RFC 2616</a>.</p>
<ul>
<li>404 Not Found: the web page you tried to download could not be found. Skip the record.</li>
<li>400 Bad Request: the server has deemed the client&#8217;s HTTP request as malformed. Either retry, or double check your code!
</li>
<li>401 Unauthorized: authentication is required before proceeding. Either skip the record, or add authentication to your code.</li>
<li>403 Forbidden: you are trying to access something you are not allowed to access. This is a common HTTP error thrown when your program is being <a href="http://en.wikipedia.org/wiki/Rate_limiting"><em>rate limited</em></a>.</li>
<li>500 Internal Server Error: something on the server end is wrong. Either try again immediately, or wait a while and retry the request.</li>
<li>3xx Redirect: the web page you are trying to access has moved somewhere else. In my research, these are rare but common in practice. I choose to skip these sites. You may wish not to.
</li>
</ul>
<p>In Python, these can be caught as <tt>urllib2.HTTPError</tt>. It is also possible to specify actions based on the specific HTTP error code returned:</p>
<pre class="brush: python; title: ; notranslate">
try:
    content = urllib2.urlopen(url).read()
except urllib2.HTTPError, e:
    if e.code == 404:
        print &quot;Not Found&quot;
    elif e.code == 500:
        print &quot;Internal Server Error&quot;
</pre>
<p><em>Server Errors &#8220;URLError&#8221;. </em>These occur frequently as well and seem to denote some sort of server or connection trouble, such as &#8220;Connection refused&#8221; or site does not exist. Usually, these are resolved by retrying the fetch. <i>In Python, it is very important to note that <tt>HTTPError</tt> is a subclass of <tt>URLError</tt>, so when handling both exceptions distinctly, <tt>HTTPError</tt> must be caught first.<b></p>
<p></b></i>
<pre class="brush: python; title: ; notranslate">
try:
    content = urllib2.urlopen(url).read()
except urllib2.HTTPError, e:
    ...
except urllib2.URLError, f:
    print f.reason
</pre>
<p><b><br />
</b><em>Other Bizarreness. </em>The web is very chaotic. Sometimes weird stuff happens. The rare, elusive <tt>httplib.BadStatusLine</tt> exception technically means that the server returned an error code that the client does not understand, but it can also be thrown <a href="http://www.voidspace.org.uk/python/articles/urllib2.shtml#badstatusline-and-httpexception">when the page being fetched is <em>blank</em></a>. On a recent project, I ran into a new one: <tt>httplib.IncompleteRead</tt> which has little documentation. <em>Both of these issues can usually be resolved by retrying the fetch.</em> Both of these pesky errors (and more) can be handled by simply catching their parent exception: <tt>httplib.HTTPException</tt>.</p>
<pre class="brush: python; title: ; notranslate">
try:
    content = urllib2.urlopen(url).read()
except httplib.HTTPException:
    #you've encountered a rare beast. You win a prize!
</pre>
<p><b>Everything Deserves a Second Chance</p>
<p></b>One common reaction to any error is to just throw the record out. <tt>URLError</tt>s errors are so common, that it is probably unwise to do that if you are using the data for something. Typically, these errors go away if you try again. I use the following loop to catch errors and react appropriately.</p>
<pre class="brush: python; title: ; notranslate">
attempt = 0
while attempt &lt; HTTP_RETRIES:
    attempt += 1
    try:
        temp = urllib2.urlopen(url).read()
        break
    except urllib2.HTTPError:
        break
    except urllib2.URLError:
        continue
    except httplib.HTTPException:
        continue
else:
    continue
</pre>
<p>This code attempts to fetch URL a maximum of <tt>HTTP_RETRIES</tt> times. If the fetch is successful, Python breaks out of the loop. If a <tt>URLError</tt> or <tt>HTTPException</tt> occurs, we move on to another attempt of the fetch. If we encounter an HTTP error (not found, restricted etc), give up. Depending on the error, we can modify the code to retry on certain errors, and abort on others, but for my purposes, I do not care.</p>
<p><b>The Comatose Crawler</p>
<p></b>If you have ever done a large scale crawl on a web site, you are bound to encounter a state where your crawler becomes comatose &#8211; it is running, maybe using system resources, but is not outputting anything or reporting progress. It looks like an infinite no-op loop. I have encountered this problem since I started doing web mining in 2006 and did not, until just this past weekend, realize exactly why it was happening and how to prevent it.</p>
<p>Your crawler has sunk in a swamp, and is essentially trapped. For whatever reason, the HTTP server your code is communicating with maintains an open connection, but sends no data. I suppose this could be a deadlock-type situation where the HTTP server is waiting for an additional request (?), and the crawler is waiting for output from the HTTP server. <em>It was my misunderstanding that the HTTP protocol had a built-in timeout, and I was relying on it.</em> This is apparently not the case. There is a simple way to avoid this swamp, by setting a timeout on the socket sending the HTTP request:</p>
<pre class="brush: python; title: ; notranslate">
import socket
...
HTTP_TIMEOUT = 5
socket.setdefaulttimeout(HTTP_TIMEOUT)
...
handle = urllib2.urlopen(&quot;http://www.google.com&quot;)
content = handle.read()
...
</pre>
<p>If a request to the socket goes unanswered after <tt>HTTP_TIMEOUT</tt> seconds, Python throws a <tt>urllib2.URLError</tt> exception that can be caught. In my code, I just skip these troublemakers.</p>
<pre>Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib64/python2.4/urllib2.py", line 130, in urlopen
    return _opener.open(url, data)
  File "/usr/lib64/python2.4/urllib2.py", line 358, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.4/urllib2.py", line 376, in _open
    '_open', req)
  File "/usr/lib64/python2.4/urllib2.py", line 337, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.4/urllib2.py", line 1021, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.4/urllib2.py", line 996, in do_open
    raise URLError(err)
urllib2.URLError: urlopen error="" timed="" out=""
</stdin></pre>
<p>With enough experience, dedication, blood, sweat, tears, and caffeine, data mining the jungle known as the World Wide Web becomes both simple and fun. Happy web mining!</p>
<div class="shr-publisher-510"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F02%2Fweb-mining-pitfalls%2F' data-shr_title='Web+Mining+Pitfalls'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/02/web-mining-pitfalls/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>40 Fascinating Blogs for the Ultimate Statistics Geek!</title>
		<link>http://www.bytemining.com/2011/01/40-fascinating-blogs-for-the-ultimate-statistics-geek/</link>
		<comments>http://www.bytemining.com/2011/01/40-fascinating-blogs-for-the-ultimate-statistics-geek/#comments</comments>
		<pubDate>Thu, 20 Jan 2011 08:00:28 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=503</guid>
		<description><![CDATA[<p>I am happy to report that ByteMining is listed on &#8220;40 Fascinating Blogs for the Ultimate Statistics Geek&#8220;!</p>
<p>Some of the ones that I frequently read, or are written by Twitter friends/followers (in no particular order):</p>

R-bloggers, an aggregate site containing blog posts tagged as posts about R. High quality content. 
Statistical modeling, causal inference and social science. This one is a no brainer, as it is the blog for Andrew Gelman&#8216;s group.
FlowingData by Nathan Yau (@flowingdata), fellow Statistics Ph.D. student at UCLA. Focuses on the data and information visualization side of Data Science.
dataists by Hilary Mason (@hmason, bit.ly), Vince Buffalo (@vsbuffalo, UC Davis),
Drew Conway (@drewconway, NYU), Mike Dewar (@mikedewar, Columbia),
John Myles White (@johnmyleswhite, Princeton) and others.
A new blog on several aspects of Data Science including Data Mining, visualization and uses of Statistics in current events. Heavy use of R and ggplot2.
Revolutions by Revolution Analytics provides a variety of content around R, Data Science and Statistics in general.
FiveThirtyEight by Nate Silver shares sophisticated modeling and analysis of elections and government happenings. It is in a different realm, as it attracts political news junkies (and the occasional extremist) rather than just Statisticians.
LoveStats by Annie Pettit, Ph.D. (@LoveStats) discusses Statistics as used in Social [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>I am happy to report that ByteMining is listed on &#8220;<a href="http://www.bschool.com/blog/2011/40-fascinating-blogs-for-the-ultimate-statistics-geek/">40 Fascinating Blogs for the Ultimate Statistics Geek</a>&#8220;!</p>
<p>Some of the ones that I frequently read, or are written by Twitter friends/followers (in no particular order):</p>
<ul>
<li><a href="http://www.r-bloggers.com/">R-bloggers</a>, an aggregate site containing blog posts tagged as posts about R. High quality content. </li>
<li><a href="http://www.stat.columbia.edu/%7Egelman/blog/">Statistical modeling, causal inference and social science</a>. This one is a no brainer, as it is the blog for <a href="http://www.stat.columbia.edu/%7Egelman/">Andrew Gelman</a>&#8216;s group.</li>
<li><a href="http://www.flowingdata.com">FlowingData</a> by Nathan Yau (<a href="http://twitter.com/flowingdata">@flowingdata</a>), fellow Statistics Ph.D. student at UCLA. Focuses on the data and information visualization side of Data Science.</li>
<li><a href="http://www.dataists.com">dataists</a> by Hilary Mason (<a href="http://twitter.com/hmason">@hmason</a>, <a href="http://bit.ly">bit.ly</a>), <a href="http://www.vincebuffalo.com/">Vince Buffalo</a> (<a href="http://twitter.com/vsbuffalo">@vsbuffalo</a>, UC Davis),<br />
<a href="http://www.drewconway.com/zia/">Drew Conway</a> (<a href="http://twitter.com/drewconway">@drewconway</a>, NYU), <a href="http://www.mikedewar.org">Mike Dewar</a> (<a href="http://twitter.com/mikedewar">@mikedewar</a>, Columbia),<br />
<a href="http://www.johnmyleswhite.com/">John Myles White</a> (<a href="http://twitter.com/johnmyleswhite">@johnmyleswhite</a>, Princeton) and others.<br />
A new blog on several aspects of Data Science including Data Mining, visualization and uses of Statistics in current events. Heavy use of R and <a href="http://had.co.nz/ggplot2/">ggplot2</a>.</li>
<li><a href="http://blog.revolutionanalytics.com/">Revolutions</a> by <a href="http://www.revolutionanalytics.com/">Revolution Analytics</a> provides a variety of content around R, Data Science and Statistics in general.</li>
<li><a href="http://fivethirtyeight.blogs.nytimes.com/">FiveThirtyEight</a> by <a href="http://en.wikipedia.org/wiki/Nate_Silver">Nate Silver</a> shares sophisticated modeling and analysis of elections and government happenings. It is in a different realm, as it attracts political news junkies (and the occasional extremist) rather than just Statisticians.</li>
<li><a href="http://lovestats.wordpress.com/">LoveStats</a> by Annie Pettit, Ph.D. (<a href="http://twitter.com/LoveStats">@LoveStats</a>) discusses Statistics as used in Social Media and Market Research.</li>
<li><a href="http://www.johndcook.com/blog/">The Endeavor</a>, by <a href="http://www.johndcook.com/">John D. Cook</a> (<a href="http://twitter.com/johndcook">@johndcook</a>) is more about Mathematics than Statistics, but he posts great stuff including math trivia, hobbyist Mathematics, and philosophy.</li>
</ul>
<p>You can see the full list of all 40 blogs <a href="http://www.bschool.com/blog/2011/40-fascinating-blogs-for-the-ultimate-statistics-geek/">here!</a>.</p>
<div class="shr-publisher-503"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F01%2F40-fascinating-blogs-for-the-ultimate-statistics-geek%2F' data-shr_title='40+Fascinating+Blogs+for+the+Ultimate+Statistics+Geek%21'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/01/40-fascinating-blogs-for-the-ultimate-statistics-geek/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Merry Christmas from Byte Mining!</title>
		<link>http://www.bytemining.com/2010/12/merry-christmas-from-byte-mining/</link>
		<comments>http://www.bytemining.com/2010/12/merry-christmas-from-byte-mining/#comments</comments>
		<pubDate>Fri, 24 Dec 2010 19:57:03 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/2010/12/merry-christmas-from-byte-mining/</guid>
		<description><![CDATA[</p>
To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! I will be spending the holidays coding on my new Motorola Droid X (goodbye AT&#38;T!).


]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><div align="center"><img src="http://www.bytemining.com/wp-content/uploads/2010/12/merry_christmas.jpg" alt="" /></p>
<div align="left">To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! I will be spending the holidays coding on my new Motorola Droid X (goodbye AT&amp;T!).
</div>
</div>
<div class="shr-publisher-500"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2010%2F12%2Fmerry-christmas-from-byte-mining%2F' data-shr_title='Merry+Christmas+from+Byte+Mining%21'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2010/12/merry-christmas-from-byte-mining/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Some Lessons in Production Development (Hadoop) &#8211; Part 1</title>
		<link>http://www.bytemining.com/2010/12/some-lessons-in-production-development-hadoop-part-1/</link>
		<comments>http://www.bytemining.com/2010/12/some-lessons-in-production-development-hadoop-part-1/#comments</comments>
		<pubDate>Fri, 17 Dec 2010 22:42:40 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Programming Languages]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=476</guid>
		<description><![CDATA[<p>Wow. I can&#8217;t believe it has been a month since I have posted. On December 1, I started a new chapter in my life, working full time as a Data Scientist at the Rubicon Project. Needless to say, that has been keeping me occupied, as well as thinking about working on my dissertation. For the time, I am getting settled in here.</p>
<p>When I accepted this position, one of my hopes/expectations would be to become professionally competent and confident in C, Java, Python, Hadoop, and the software development process rather than relying on hobby and academic knowledge. That is something a degree cannot help with. It has been a great experience, although very frustrating, but that is expected when jumping into development professionally.
&#160;&#160;&#160;&#160; </p>
<p>I am writing this post to chronicle what I have learned about using Hadoop in production and how it majorly differs from its use in my research and personal analysis.
To start, I was asked to check out a huge stack of code from a Subversion repository. But then what?</p>
<p>But you&#8217;re a Computer Scientist! This should be easy!</p>
<p>The first part is true, but there is a stark difference between a garden variety computer scientist and one that converts from [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Wow. I can&#8217;t believe it has been a month since I have posted. On December 1, I started a new chapter in my life, working full time as a Data Scientist at <a href="http://www.rubiconproject.com">the Rubicon Project</a>. Needless to say, that has been keeping me occupied, as well as thinking about working on my dissertation. For the time, I am getting settled in here.</p>
<p>When I accepted this position, one of my hopes/expectations would be to become professionally competent and confident in C, Java, Python, Hadoop, and the software development process rather than relying on hobby and academic knowledge. That is something a degree cannot help with. It has been a great experience, although very frustrating, but that is expected when jumping into development professionally.<br />
<img src="http://www.bytemining.com/wp-content/uploads/2010/12/java.png" alt="" height="91" width="50" />&nbsp;&nbsp;&nbsp;&nbsp; <img src="http://www.bytemining.com/wp-content/uploads/2010/12/hadoop.jpg" alt="" /></p>
<p>I am writing this post to chronicle what I have learned about using Hadoop in production and how it majorly differs from its use in my research and personal analysis.<br />
To start, I was asked to check out a huge stack of code from a Subversion repository. But then what?</p>
<p><strong>But you&#8217;re a Computer Scientist! This should be easy!</strong></p>
<p>The first part is true, but there is a stark difference between a garden variety computer scientist and one that converts from another field. Homegrown developers typically have an undergraduate degree in Computer Science (mine is close, but not purely CS). They have a strict and challenging curriculum of coursework, and their lives are peppered with summer internships. In these experiences, they are seasoned with professional work experience that they were not expected to have beforehand. Once they graduate, they have the skills necessary to work as a software engineer.</p>
<p>&#8220;Converts&#8221; typically have a love for Computer Science, but typically hold undergraduate degrees in related fields. Mine was Mathematics of Computation and Statistics. The Statistics B.S. was totally irrelevant. The Math of Computation B.S. was relevant in discovering my true interest, however, my life was consumed with writing proofs and solving puzzles. I only programmed for fun and to solve problems that I faced in daily life. Shortly after starting a Ph.D. in Statistics, I discovered that I wanted a career change to Computer Science and software engineering. Although I was subjected to the same curriculum as the undergraduates, it was rushed and did not provide the full experience the college CS majors received. Although I am booksmart in the fields, I come to the development world either being expected to know the ins and outs of engineering, or expected to pick it up quickly without the pampering. Learning engineering this way is fun, and not so mundane, but more frustrating.</p>
<p><strong>But I though you already knew Java and Hadoop?</strong></p>
<p>Well, that is all relatively speaking! The Java code I have written was not for Hadoop. The code that I write for my own research and hobby is much different from the code I am expected to write in production. In my research I always used <a href="http://www.python.org">Python</a> (except in a few instances where I used Java) and the <a href="http://hadoop.apache.org/common/docs/r0.15.2/streaming.html">Hadoop Streaming</a> package. Although Streaming is a great package, I feel it lacks the customization that a standard Hadoop job written in Java has. It also lacks the &#8220;meat&#8221; that the vanilla Hadoop distribution enjoys in the professional world. There are also performance gains from using vanilla Hadoop, as both the code and the framework are written in the same language and there is no interface among languages to deal with.</p>
<ol>
<li>My code must execute successfully on tens and hundreds of thousands of files, not just some manageable subset that I have created.
</li>
<li>I did not write the code that created these files, so there are bugs and intricacies that I must defensively program against.</li>
<li>These files are created in real time, on the fly, as real events are occurring. Stuff happens. Sometimes it&#8217;s not good stuff, and code must be able to work with (or toss out) that data. Data for my own research has been massaged and curated by moi before running in Hadoop. In production, this is not realistic.</li>
<li>The code must be efficient. These jobs must not take forever to run, because time can equal money: either wasted CPU cycles that others could use, or time in the cloud.</li>
<li>Development time must be used to integrate existing code rather than reinventing the wheel. My coworkers have already written a lot of code. I have been learning how to integrate this code into my own.
</li>
<li>The developer must not go rogue with code and must code carefully. Crashing the cluster or maxing out the hard drive can have dire consequences.
</li>
</ol>
<p><strong>Tip 0: Nobody&#8217;s Perfect</strong></p>
<p><emph><em>0.1 Prepare to be Frustrated</em>.</emph> I cannot stress this enough. We all like to start new things thinking that we know close to everything we need to know to get started. Not true. Unless you literally have somebody sitting next to you whose job it is to train you, expect to do a lot of self learning and to be very frustrated. So, don&#8217;t get down on yourself, get plenty of sleep, eat, and drink plenty (after hours) if necessary.</p>
<p><em><emph>0.2: Java Development can be Intimidating at First</emph></em>. To use somebody else&#8217;s Java code, just point the <tt>CLASSPATH</tt> to the <tt>src</tt> directory containing the package. If the source code is packaged correctly, this should work fine. The <tt>CLASSPATH</tt> can be set in your <tt>.bashrc</tt> file, or you can pass it using the <tt>-cp</tt> flag. Note that you must pass the <tt>CLASSPATH</tt> both to the compiler <tt>javac</tt> as well as the runtime <tt>java</tt>. This gets old really fast&#8230;more on this later.</p>
<p>When using a JAR file containing an archive of a package, the first line of each source file has a line that starts with <tt>package</tt>. This is essentially an &#8220;address&#8221; that allows you to point to that class without having to know its name, and is a good way to index the content of the JAR file.</p>
<p><em><emph>0.3: Know your Limitations</emph></em>. A coworker would frequently chime in &#8220;use Eclipse!&#8221; Time and time again I have tried to use Eclipse and it makes me want to cry. Eclipse introduces its own level of complexity and customization to the build process that feels icky to me.</p>
<p>I have found that <tt>vim</tt> and <tt>ant</tt> have proven valuable for the time being. Adding an IDE just potentially adds another layer of misery to the learning process. The only IDE I ever recommend is Visual C++, and I am not even a Microsoft guy.</p>
<p><em><emph>0.4: Make Subversion Play Nice</emph></em>. I did not realize that it is not necessary to check out the entire project. There is nothing wrong with checking out only the trunk directory. You can also give it a new name, so instead of having some meaningless directory called <tt>trunk</tt>, I can call it <tt>ryans_first_project</tt>:</p>
<p><tt>svn co http://svn.mydomain.com/svn/project/trunk ryans_first_project</tt></p>
<p><strong>Tip 1: Partition your Data</strong></p>
<p>Suppose we have <emph>n</emph> mappers and more than one of these mappers sees a key <tt>cat</tt>. We want to make sure that each instance of <tt>cat</tt> gets sent to the same reducer. One way to do this is to use a <tt>Partitioner</tt>. An ad-hoc way to do this is to just hash the key and get the remainder after dividing by some number <emph>k</emph>. This <emph>might</emph> also provide a way of controlling how many reduce tasks are spawned (I am not positive about that though).</p>
<p><strong>Tip 2: The Key in the Mapper is Useless</strong></p>
<p>If you use <tt>TextInputFormat</tt>, know that lines of text come into the mapper in key/value pairs and also leave the mapper in key/value pairs. The key coming INTO the mapper will be of this weird type <tt>LongWritable</tt>. <strong>It is useless.</strong> It is just the byte offset of the line in the file. What we really want to parse is the <strong>value</strong> coming into the mapper, and we emit the key and value to the next phase.</p>
<p><strong>Tip 3: Use ant (or maven) to Configure and Build your Jobs</strong></p>
<p>As I said, passing <tt>CLASSPATH</tt>s around <tt>.bashrc</tt> and on the command-line is a clusterf*ck. In my case, the code stack I am working with has a <tt>build.xml</tt> file. When I write source code, I don&#8217;t need to do anything, <tt>ant</tt> knows to compile the file because the <tt>build.xml</tt> file contains instructions to compile it.</p>
<p>Also, all and any libraries I need I just dump into a <tt>lib</tt> directory, and by simply adding the name of the library file to <tt>build.xml</tt> (it is obvious where to put it), it is automatically added to the <tt>CLASSPATH</tt> at compile time. To build the project, all I type is</p>
<p><tt>ant some_target</tt></p>
<p>and it spits out a cute little JAR file in the target directory, ready for me to use with Hadoop. Of course, this process actually builds the <em>entire </em>project, not just my code, but it only takes 10 seconds or so to build.</p>
<p><strong>Tip 4: <tt>input_dir</tt> and <tt>output_dir</tt> are just Parameters</strong></p>
<p>The Hadoop command given in tutorials usually has the following form</p>
<p><tt>hadoop jar somejarfile.jar a.class.name input output</tt></p>
<p>input and output are not &#8220;set&#8221; parameters, they are just plain old parameters that you can interact with in your Java programs by using the <tt>argv</tt> array. If you want to pass in 10 directories, you can do that!</p>
<p><strong>Tip 5: The JAR file is the Key to Success</strong></p>
<p>Once you have a JAR file built by hand or by ant, all you need to do is move that baby around to wherever you want to run the Hadoop job. Of course, this assumes that Hadoop is installed on the machines you want to use. Then, with one file, running the job is as simple as:</p>
<p><tt>hadoop jar myjarfile.jar com.bytemining.my.package.name input_dir output_dir</tt></p>
<p><strong>Tip 6: When it Gets to be Stressful, It&#8217;s Nothing a Little Ping Pong Can&#8217;t Fix</strong></p>
<p>I&#8217;ve only been at this for 2 weeks&#8230; there will undoubtedly be a part 2.</p>
<div class="shr-publisher-476"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2010%2F12%2Fsome-lessons-in-production-development-hadoop-part-1%2F' data-shr_title='Some+Lessons+in+Production+Development+%28Hadoop%29+-+Part+1'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2010/12/some-lessons-in-production-development-hadoop-part-1/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>My Day at ACM Data Mining Camp III</title>
		<link>http://www.bytemining.com/2010/11/my-day-at-acm-data-mining-camp-iii-dmcamp/</link>
		<comments>http://www.bytemining.com/2010/11/my-day-at-acm-data-mining-camp-iii-dmcamp/#comments</comments>
		<pubDate>Sat, 13 Nov 2010 19:16:44 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=458</guid>
		<description><![CDATA[<p>My first time at ACM Data Mining Camp was so awesome, that I was thrilled the make the trip up to San Jose for the November 2010 version. In July, I gave a talk at the Emerging Technologies for Online Learning Symposium conference with a faculty member in the Department of Statistics, at the Fairmont. The place was amazing, and I told myself I would save up to stay there. This trip gave me an opportunity to check it out, and pretend that I am posh for a weekend  . The night I arrived I had the best dinner and drinks at this place called Gordon Biersch. I had the best garlic fries and BBQ burger I have ever had. I ate it with a Dragonfruit Strawberry Mojito, the Barbados Rum Runner, and finished off with a Long Island Iced Tea, so the drinks were awesome as well. Anyway, to the point of this post&#8230;</p>
<p>The next morning I made the short trek to the PayPal headquarters for a very long 9am-8pm day. Since I came up here for the camp, I wanted to make the most of it and paid the $30 for the morning session, even though I [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img style="float: left; margin: 20px" src="http://www.bytemining.com/wp-content/uploads/2010/11/ACM.jpg" alt="" height="100" width="100" /><a href="http://www.bytemining.com/2010/03/acm-data-mining-camp-dmcamp/">My first time at ACM Data Mining Camp</a> was so awesome, that I was thrilled the make the trip up to San Jose for the <a href="http://www.sfbayacm.org/?p=1854">November 2010 version</a>. In July, I gave a talk at the <a href="http://sloanconsortium.org/et4online">Emerging Technologies for Online Learning Symposium</a> conference with a faculty member in the Department of Statistics, at the <a href="http://www.fairmont.com/sanjose">Fairmont</a>. The place was amazing, and I told myself I would save up to stay there. This trip gave me an opportunity to check it out, and pretend that I am posh for a weekend <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> . The night I arrived I had the best dinner and drinks at this place called <a href="http://www.gordonbiersch.com/">Gordon Biersch</a>. I had the best garlic fries and BBQ burger I have ever had. I ate it with a <a href="http://miamismpix.com/2010/07/17/dragon-fruit-strawberry-mojito-from-gordon-biersch/">Dragonfruit Strawberry Mojito</a>, the Barbados Rum Runner, and finished off with a Long Island Iced Tea, so the drinks were awesome as well. Anyway, to the point of this post&#8230;</p>
<p>The next morning I made the short trek to the <a href="http://www.paypal.com">PayPal</a> headquarters for a very long 9am-8pm day. Since I came up here for the camp, I wanted to make the most of it and paid the $30 for the <a href="http://www.kdnuggets.com/2010/10/acm-data-mining-camp-nov13-san-jose.html">morning session</a>, even though I had not intended on going originally.</p>
<p><b>Overview of Data Mining Algorithms with Dean Abbott</p>
<p></b>The paid morning session from 9-11:30 was led by Dean Abbott (<a href="http://twitter.com/deanabb">@deanabb</a>), the president of <a href="http://www.abbottanalytics.com/data-mining-consulting-services-about.php">Abbott Analytics</a>. It was an excellent overview of the basic data mining algorithms, but obviously 2 hours is not enough time to cover the algorithms in detail. When I first scanned through the slides I was concerned that I would be bored, but I actually learned a few things that made it worth it.</p>
<p>One of the first concepts I learned about was <a href="http://en.wikipedia.org/wiki/CHAID">CHAID</a>, (CHi-squared Automated Interaction Detector) a decision tree algorithm that can build wide n-ary trees rather than just binary trees like in CART. CHAID can also output a p-value, making diagnostic analysis more practical. I also did not know that decision trees could be used as a pre-analysis step to find interactions among variables. The output from this step can be used to construct better regression models including the proper interaction terms.</p>
<p>We moved on to <a href="http://en.wikipedia.org/wiki/Linear_regression">linear regression</a> and <a href="http://en.wikipedia.org/wiki/Logistic_regression">logistic regression</a> which were obviously very basic. Next, we spent some time discussing <a href="http://en.wikipedia.org/wiki/Neural_networks">neural networks</a>. It is no secret that I detest neural networks. I don&#8217;t know what it is, but they annoy me to no end. It seems like there is very little science behind how to choose quantities such as the parameters, number of neurons or number of hidden layers. Maybe it is just me, but neural networks <em>feel </em>like a hack. Besides, anything that can be done with a <a href="http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&amp;id=pdf_1&amp;handle=euclid.ss/1177010638">neural network can be done using plain old statistics</a>.</p>
<p>Dean also discussed other methods including <a href="http://en.wikipedia.org/wiki/Nearest_neighbor_search">nearest neighbors</a>, <a href="http://en.wikipedia.org/wiki/Radial_basis_functions">Radial Basis Functions</a>, <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Bayes Classifiers and Naive Bayes</a>, <a href="http://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Machines (SVM)</a>. </p>
<p>At this point, we had to start rushing which was too bad. We briefly discussed <a href="http://en.wikipedia.org/wiki/Machine_learning_ensemble">ensemble methods</a> including <a href="http://en.wikipedia.org/wiki/Bootstrap_aggregating">bagging</a>, <a href="http://en.wikipedia.org/wiki/Boosting">boosting</a> (<a href="http://en.wikipedia.org/wiki/Adaboost">AdaBoost</a>), and <a href="http://en.wikipedia.org/wiki/Random_forest">Random Forests</a>. We spent about 5 minutes on <a href="http://en.wikipedia.org/wiki/Unsupervised_learning">unsupervised</a> methods as well including <a href="http://en.wikipedia.org/wiki/Kmeans">k-means</a>, <a href="http://en.wikipedia.org/wiki/Kohonen_map">Kohonen maps<br />
(self-organizing maps)</a>. I am not sure what happened to <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">principal components analysis (PCA)</a>, <a href="http://en.wikipedia.org/wiki/Multidimensional_scaling">multidimensional scaling (MDS)</a><br />
or <a href="http://en.wikipedia.org/wiki/Independent_component_analysis">independent components analysis (ICA)</a>. As I mentioned to a friend, unsupervised learning always gets the shaft. We had slides for <a href="http://en.wikipedia.org/wiki/Association_rules">association rules</a> (<a href="http://en.wikipedia.org/wiki/Apriori_algorithm">Apriori algorithm</a>), but we did not have time to discuss it. I was hoping <a href="http://en.wikipedia.org/wiki/Semi-supervised_learning">semi-supervised learning</a>, <a href="http://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement learning</a> and<br />
<a href="http://en.wikipedia.org/wiki/Recommendation_system">recommendation systems</a> would be mentioned, but there was not enough time even for what was on the agenda.</p>
<p>I wish we had more time. Unfortunately, there were way too many questions and a few individuals that wished to waste minutes debating and challenging the speaker.</p>
<p><i>Dean Abbott teaches a full, two-day course (not free) in data mining that may be of interest. <a href="http://www.abbottanalytics.com/data-mining-courses-and-seminars.php">Click here</a> for more information. I usually would not post something like this, but he is an excellent, and practical speaker<br />
</i><br />
What I found a bit surprising was that this session was at a Data Mining event. I would hope that most of the people in attendance had familiarity with a good amount of the material. The Netflix session at the previous ACM Data Mining Camp seemed to align better with the target audience of the day&#8217;s events. On the other hand, there were a ton of people in the session. Perhaps this was a good money-maker for the Bay Area ACM, because perhaps some people got their training on in the morning, and then left after lunch.</p>
<p>The eBay sponsored lunch was phenomenal, just like last time. I got a smoked ham sandwich and my little box also contained a bag of potato chips, an apple, and an oatmeal-raisin muffin looking thing (it was supposed to be a cookie but the baker got carried away).</p>
<p><b>Main Session</p>
<p></b>Next up was the main session which mainly consisted of a QA session with some experts in the field and also some job announcements from companies that sponsored the event.<b><br />
</b>
<ul>
<li>Somebody from <a href="http://www.sigkdd.org/">SIGKDD</a> announced the <a href="http://kdd.org/kdd/2011/">SIGKDD 2011</a> conference to be held in San Diego, CA in August 2011.</li>
<li>A research engineer from <a href="http://labs.ebay.com/">eBay</a> discussed the fact that many equate data mining with text mining and search. He drove home the point that at eBay, researchers are interested in other things such as social network analysis and valuing links.</li>
<li>The <a href="http://en.wikipedia.org/wiki/Bayesian_network">Bayesian networks</a> analysis tool <a href="http://www.bayesia.com/">BayesiaLab</a> from <a href="http://www.bayesia.com/">Bayesia</a> was introduced and the developers gave a shout-out to <a href="http://bayes.cs.ucla.edu/jp_home.html">Judea Pearl</a> over at UCLA. Dr. Pearl said about Bayesia, &#8220;This is good stuff!&#8221;</li>
<li><a href="http://www.linkedin.com">LinkedIn</a> talked about some of its new projects including <a href="http://blog.linkedin.com/2010/10/04/linkedin-career-explorer/">CareerExplorer</a>, that takes the professional graph and shows what a college student&#8217;s future career could potentially be. LinkedIn&#8217;s product team has engineers that specialize in machine learning, statistics, and data mining. They also host an &#8220;InDay&#8221; each month which is essentially its version of a hackday. They also mentioned that LinkedIn is investing very heavily in Hadoop, and they just tripled the size of their Hadoop cluster.</li>
<li><a href="http://www.netflix.com">Netflix</a> is hiring &#8220;like crazy&#8221; and expanding internationally. Data mining engineers work on Cinematch technology and other projects.</li>
<li>Joseph Rickert from <a href="http://www.revolutionanalytics.com/">Revolution Analytics</a> introduced the crowd to its commercial version of R.</li>
<li><a href="http://salford-systems.com/">Salford Systems</a> talked a bit about its products including CART and Random Forests.</li>
<li><a href="http://www.sas.com">SAS</a> was also present and mentioned that it is looking for people that want to publish their books on data mining with them.
</li>
</ul>
<p><b>Large Data with R<br />
<img style="float: left; margin: 20px" src="http://www.bytemining.com/wp-content/uploads/2010/11/logo_revolutionanalytics.gif" alt="" style="float: left;" /><br />
</b>Given that I gave a talk to the <a href="http://www.meetup.com/LAarea-R-usergroup/">Los Angeles R Users&#8217; Group</a> on <a href="http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/">working with large datasets in R</a>, I figured this would be an enlightening session. Unfortunately, the R skills that were covered were very basic, and it was little more than a commercial for Revolution Computing&#8217;s version of R. The take away from the session was basically just that the Revolution version has optimized methods that read the data into memory in chunks and operate on each chunk (perhaps) independently. This is nothing that a nice integration with Hadoop could not provide. No mention was made of the free open-source solutions for large datasets in R: <tt>bigmemory</tt> and <tt>ff</tt>. </p>
<p>If I had a time machine, I would have instead attended <a href="http://www.zinkov.com/">Rob Zinkov</a>&#8216;s talk on <a href="http://en.wikipedia.org/wiki/Sentiment_analysis">Sentiment Analysis</a>. Rob is a second-year Ph.D. student in Computer Science at University of Southern California&#8217;s Information Sciences Institute and a member of the Los Angeles R Users&#8217; Group.</p>
<p><b>Mahout</p>
<p></b><img style="float: left; margin: 20px" src="http://www.bytemining.com/wp-content/uploads/2010/03/mahout-logo-100-tm.jpg" alt="" style="float: left;" />Next up was <a href="http://www.deepdyve.com/corp/about/ted_dunning">Ted Dunning</a> discussing <a href="http://mahout.apache.org/">Mahout</a>. I was elated to see practically each hand in the room shoot up when we were asked to vote on which sessions we wanted to attend. Mahout is a Java framework that provides scalable machine learning and data mining algorithms. Mahout code interacts with <a href="http://hadoop.apache.org/">Hadoop</a> to provide map-reduce functionality for algorithms. The purpose of Mahout is to provide early production quality scalable data mining. Some classification methods currently in Mahout include <a href="http://en.wikipedia.org/wiki/Mixture_model">mixture modeling</a>, <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_Allocation">Latent Dirichlet Allocation (LDA)</a>, logistic regression, naive Bayes, <a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">Complementary Naive Bayes</a>, latent factor loglinear algorithms, <a href="http://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient descent SVM</a>, and random forests. Some of these methods are parallel, and some are sequential. Large scale SVD is currently being worked on, and still has some rough edges.</p>
<p>The biggest news in this talk was how well Mahout has been snapped up by industry. AOL uses Mahout&#8217;s methods for its product recommendation services. &#8220;A large scale electronics company&#8221; (name was secret) uses Mahout for music recommendations. Other uses of Mahout in industry include frequent itemset mining, and spam website detection.</p>
<p>Dunning mentioned that Mahout does seem to work well with sparse matrices assuming that if an element of the matrix is unspecified, it is equal to 0. If I understood his statement correctly, this means that Mahout works well with most sparse matrices. Some more technical  gems  I learned is that Mahout can do stochastic gradient descent (although it is sequential), and its implementation uses per-term annealing which can then be used for supervised learning with logistic regression. These implementations optimize for high dimensional sparser data, possibly with interactions. These methods are scalable and fast to train. Ted mentioned that for a particular test case, the optimization converged in less than 10,000 examples. For large datasets, it is possible that the method will converge before seeing all of the data. With that said, in the &#8220;best&#8221; case, an algorithm using stochastic gradient descent can be sublinear in the number of training examples.</p>
<p>Towards the end of the session, Ted answered some questions personally, and it gave me some insight into data mining methods. He is not a fan of &#8220;most common&#8221; itemset algorithms (Apriori, Eclat, etc.) because they are difficult to parallelize due to their quadratic nature. Instead, he prefers  co-occurence analysis methods. He also prefers <a href="http://www.r-project.com">R</a> to <a href="http://www.cs.waikato.ac.nz/ml/weka/">Weka</a>, and he loves <a href="http://www.python.org">Python</a>. I also prefer R to Weka, and love Python <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</p>
<p><b>Large Scale Supervised Learning</p>
<p></b>The next talk I attended was rehearsed with slides etc. and was presented by <a href="http://www.junlinghu.com/">Junling Hu</a>, from eBay. Junling has a Ph.D. in Computer Science from University of Michigan, Ann Arbor. Although the talk began with (another) quick review of data mining algorithms, the meat of the talk was on how to parallelize some of these algorithms. The challenge of parallelization is that we must maintain a global  function, and messages must be passed to update this global function. One basic way to do this is how map-reduce does it: split the data into subsets, perform some function on each subset and reduce the computations into one result. Each method has its own way it can be parallelized.</p>
<p><em>Decision Trees. </em>One type of parallelization for decision trees is based on the tree nodes. One can write a map-reduce function to recursively split nodes, like the <a href="http://www.vldb.org/pvldb/2/vldb09-537.pdf">PLANET method proposed by Google</a>.&nbsp; We have some sequence of nodes in a tree and we maintain a model. With the proposed framework, we start with an empty set of nodes. We maintain a map-reduce queue for all the nodes we are going to divide and we also maintain an in-memory queue of nodes. The goal is  to find the best splits based on some measure, in PLANET&#8217;s case, variance. We run some controller that controls map-reduce jobs. Each map-reduce job sends back data and we update the global variables: the model, the map-reduce queue and the in-memory queue. Then, new map-reduce jobs are constructed and the process continues. Due to time constraints, I had a difficult time following all of what was going on, but more information about the algorithm discussed can be found here.</p>
<p><em>Support Vector Machines. </em>Hu mentioned two types of SVMs: primal SVM and parallel SVM. The idea behind parallelizing SVMs is to use block-based optimization by dividing either the data, or the features, into blocks. Stochasic gradient descent can be used for block minimization for the primal SVM. Some other ways mentioned included randomly splitting the data (bootstrapping perhaps), or using data compression (dimension reduction, perhaps). One resource for parallel SVM is the <a href="http://code.google.com/p/psvm/">psvm project on Google Code</a> which provides distributed gradient computation for maximum entropy, and parallel logistic regression.</p>
<p>Junling listed a few resources:</p>
<ul>
<li><a href="http://code.google.com/p/ml-mapreduce/">ML-MapReduce</a>, routines for machine learning using map-reduce (currently only logistic regression).</li>
<li><a href="http://mahout.apache.org">Mahout</a></li>
<li><a href="http://www.alphaworks.ibm.com/tech/pml">IBM Parallel Machine Learning Toolbox</a>, which is blackbox and not open-source.</li>
</ul>
<p><b>Monetizing Images</p>
<p></b>The final time slot was slim pickings for me, and the second time slot (when I attended Mahout) hosted 4 sessions I wanted to attend that all conflicted with each other. The discussion about the association between tweets and stock prices sounded interesting, except for the stock prices part. So, I attended the Monetizing Images session. This session was more of a discussion about data mining with images in general.</p>
<ul>
<li>Apparently, Facebook can identify brands in images and use these brands  for advertising. I have not seen this but I do not really doubt that it  is true. </li>
<li><a href="http://www.photosynth.net">Microsoft Photosynth</a> crawls freely available images on the web and uses these images to create an entire scene from them, essentially allowing someone to tour Rome using just images on the web.</li>
<li>The <a href="http://www.usps.gov">United States Postal Service</a> uses k-nearest neighbors for intelligent character recognition (ICR) used to read addresses on envelopes.</li>
<li>Google Goggles</li>
<li><a href="http://www.zunavision.com">ZunaVision</a> allows advertisers to embed logos and ads into a video with more flexibility than with things like green screens used on football fields etc.
</li>
</ul>
<p>We also discussed <a href="http://en.wikipedia.org/wiki/Forensic_photography">forensic photography</a> and the ability to detect if an image has been doctored. We also discussed some techniques for measuring image similarity. <a href="http://www.cs.ubc.ca/%7Elowe/">David Lowe</a> from the University of British Columbia maintains a <a href="http://www.cs.ubc.ca/%7Elowe/vision.html">list of uses and companies regarding computer vision on his website</a>.</p>
<p>It is interesting to note that after the fact, Ken Weiner (<a href="http://twitter.com/kweiner">@kweiner</a>) from <a href="http://gumgum.com/">GumGum</a> in Los Angeles indicated that monetizing images is exactly what they are doing. Sounds interesting!</p>
<div align="center"><img src="http://www.bytemining.com/wp-content/uploads/2010/11/24591_963690848676_2507506_52878373_4891738_n.jpg" alt="" height="176" width="240" /></p>
</div>
<p>At this point I was exhausted. I like meeting Twitter friends and followers, but people very quiet! It was a pleasure to meet Scott Waterman (<a href="http://twitter.com/tswaterman">@tswaterman</a>) and Tommy Chheng (<a href="http://twitter.com/tommychheng">@tommychheng</a>). I also got to reconnect with my friend Shaun Ahmadian (<a href="http://twitter.com/ssahmadian">@ssahmadian</a>) from the UCLA Department of Computer Science as well as Rob Zinkov (<a href="http://twitter.com/zaxtax">@zaxtax</a>) who also made the trek from Los Angeles to San Jose. </p>
<div class="shr-publisher-458"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2010%2F11%2Fmy-day-at-acm-data-mining-camp-iii-dmcamp%2F' data-shr_title='My+Day+at+ACM+Data+Mining+Camp+III'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2010/11/my-day-at-acm-data-mining-camp-iii-dmcamp/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Exciting Tools for Big Data: S4, Sawzall and mrjob!</title>
		<link>http://www.bytemining.com/2010/11/exciting-tools-for-big-data-s4-sawzall-and-mrjob/</link>
		<comments>http://www.bytemining.com/2010/11/exciting-tools-for-big-data-s4-sawzall-and-mrjob/#comments</comments>
		<pubDate>Sun, 07 Nov 2010 07:00:52 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=412</guid>
		<description><![CDATA[<p>This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.</p>
<p>Yahoo&#8217;s S4: Distributed Stream Computing Platform</p>
<p></p>
<p>First off, it must be said. S4 is NOT real-time map-reduce! This is the meme that has been floating around the Internets lately. </p>
<p>S4 is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is not a Hadoop project. A matter of fact, it is not even a form of map-reduce. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.</p>
<p>Pieces of data, apparently called events, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:</p>

emit another event that will be consumed by another PE, or
publish some result


<p>Streaming data is different from non-streaming data in that the user does not know how much data will be transmitted, and at what rate. Analysis on [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.</p>
<p><b>Yahoo&#8217;s S4: Distributed Stream Computing Platform</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2010/11/s4.png" alt="" /></p>
<p></b>First off, it must be said. <em>S4 is NOT real-time map-reduce!</em> This is the meme that has been floating around the Internets lately. </p>
<p><a href="http://s4.io/">S4</a> is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is <em>not </em>a <a href="http://hadoop.apache.org/">Hadoop</a> project. A matter of fact, it is not even a form of <a href="http://en.wikipedia.org/wiki/MapReduce">map-reduce</a>. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.</p>
<p>Pieces of data, apparently called <em>events</em>, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:</p>
<ol>
<li>emit another event that will be consumed by another PE, or</li>
<li>publish some result
</li>
</ol>
<p>Streaming data is different from non-streaming data in that the user does not know how much data will be transmitted, and at what rate. Analysis on streaming data should not rely on storing the data, as the amount of required disk space is unknown. Additionally, the processing of the data is likely to take longer than the rate of transmission would allow. Since the data is not stored, special algorithms must be developed for aggregating and analyzing data. <a href="http://aicoder.blogspot.com">Neal Richter</a> (<a href="http://twitter.com/nealrichter">@nealrichter</a>) has an <a href="http://aicoder.blogspot.com/2009/09/references-for-mining-from-streaming_18.html">excellent list of resources on research on the management and analysis of streaming data</a>.</p>
<p>More information can be found at the <a href="http://wiki.s4.io/">S4 Wiki</a> and <a href="http://s4.io/">S4 main site</a>, that contains <a href="http://wiki.s4.io/Tutorials/Tutorials">tutorials</a>, a <a href="http://wiki.s4.io/Manual/Manual">manual</a>, a <a href="http://wiki.s4.io/Cookbook/Cookbook">cookbook</a> as well as API documentation. The Yahoo project page, which contains very little information, can be found <a href="http://labs.yahoo.com/event/99">here</a>. The <a href="https://github.com/s4">source code</a> is on everybody&#8217;s favorite site, <a href="https://github.com/s4">GitHub</a>.</p>
<p>S4 is released under the Open Source Apache 2.0 license. It must also be said that <em>S4 is not to be confused with <a href="http://aws.amazon.com/s3/">S3</a>!</em><strong> </strong>They are two totally different technologies!<br />
<b><br />
</b>Remember, S4 is not a Hadoop. A matter of fact, Bill McColl over at Gigaom has pondered a&#8221;<a href="http://gigaom.com/cloud/beyond-hadoop-next-generation-big-data-architectures/">NoHadoop</a>&#8221; movement&#8230;that parallels (see what I did there?) our favorite NoSQL movement.<br />
<b><br />
Google&#8217;s Sawzall: Programming Language for Big Data</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2010/11/about_logo.gif" alt="" /><br />
</b>Google made a contribution of its own. <a href="http://en.wikipedia.org/wiki/Sawzall_%28programming_language%29">Sawzall</a> is an interpreted, procedural <a href="http://en.wikipedia.org/wiki/Domain-specific_language">DSL</a> for working with huge amounts of data.&nbsp; <a href="http://glinden.blogspot.com/">Greg Linden</a> (<a href="http://twitter.com/greglinden">@greglinden</a>) made an interesting comparison, <a href="http://glinden.blogspot.com/2007/04/yahoo-pig-and-google-sawzall.html">suggesting that Yahoo&#8217;s Pig project is similar to Google&#8217;s Sawzall project</a>. At Google, it is used on top of existing systems including <a href="http://en.wikipedia.org/wiki/Protocol_Buffers">Protocol Buffers</a>, the <a href="http://en.wikipedia.org/wiki/Google_File_System">Google File System</a> and MapReduce. Sawzall reads one line of data at a time, and does not preserve state between reads so it is useful in the map phase of a map-reduce job. There are also routines for statistical aggregation that can be used in a reduce phase. Users compile Sawzall source using the <a href="http://code.google.com/p/szl/">szl</a> compiler that can be found <a href="http://code.google.com/p/szl/">here</a>.</p>
<p>Much more detailed information can be found on the <a href="http://code.google.com/p/szl/wiki/Overview">Google Code overview site for the szl project</a>. For programming language buffs, the language specification can be found <a href="http://szl.googlecode.com/svn/doc/sawzall-spec.html">here</a>.</p>
<p>The research publication discussing this project in more detail is <a href="http://research.google.com/archive/sawzall.html">here</a>.<br />
<b><br />
Yelp&#8217;s mrJob: Distributed Computing for Everybody<br />
<img src="http://www.bytemining.com/wp-content/uploads/2010/11/yelp2.jpg" alt="" height="85" width="108" /><br />
</b>Ok, I&#8217;ve ignored Hadoop long enough&#8230;</p>
<p>Every time you write a review complaining about the terrible gas the burrito at El Torasco&#8217;s gave you, or the amazing buzz you got from their margaritas, <a href="http://www.yelp.com">Yelp</a> processes it and extracts some type of information from it. Yelp accumulates about 100GB of data per day! Naturally, Yelp analyzes this data using map-reduce, <a href="http://aws.amazon.com/elasticmapreduce/"><em>Amazon Elastic </em>MapReduce</a> to be exact. </p>
<p>You see, most companies are building <em>up</em> their Hadoop clusters but Yelp decided to tear theirs down. In May 2010, Yelp engineering moved its data processing to Amazon. <i><strong>mrjob is Yelp&#8217;s Python framework for writing map-reduce jobs and interacting with Amazon EMR!</strong></p>
<p></i>Below is an example from their engineering blog. It is so simple it is beautiful!</p>
<pre class="brush: python; title: ; notranslate">
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r&quot;[\w']+&quot;)

class MRWordFreqCount(MRJob):
    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)
    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount().run()
</pre>
<p>
The mrjob code is available on <a href="http://github.com/Yelp/mrjob">GitHub</a> as is the <a href="http://packages.python.org/mrjob/">Python documentation</a>.<br />
<i><br />
Oh, and El Torasco&#8217;s is to be a fictional name I use in this post.</i></p>
<div class="shr-publisher-412"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2010%2F11%2Fexciting-tools-for-big-data-s4-sawzall-and-mrjob%2F' data-shr_title='Exciting+Tools+for+Big+Data%3A+S4%2C+Sawzall+and+mrjob%21'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2010/11/exciting-tools-for-big-data-s4-sawzall-and-mrjob/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Accessing R from Python using RPy2</title>
		<link>http://www.bytemining.com/2010/10/accessing-r-from-python-using-rpy2/</link>
		<comments>http://www.bytemining.com/2010/10/accessing-r-from-python-using-rpy2/#comments</comments>
		<pubDate>Mon, 25 Oct 2010 04:27:03 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Algorithms]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=397</guid>
		<description><![CDATA[<p>This past Tuesday I had the opportunity to present a short talk (a bit long) related to text mining at the Los Angeles R Users&#8217; Group. Since I do most of my text mining in Python, I took this opportunity to discuss RPy2, an interface to R from Python.    My slides are below:
</p>
Accessing R from Python using RPy2
View more presentations from Ryan Rosario.
<p>    
Download/view slides here.    Topics include

Using Python with R with an example using web mining.
Web mining using pure R rather than Python. 

<p>  Code for demonstration is here:

offtopic_demo.py is a pure Python script that extracts data from a web forum and dumps it to disk. To actually use it, you will need to register for an account.
RPy2_demo.py reads the data from the forum from disk and calls R from Python to perform some basic analysis. 
curljson_demo.R grabs some JSON data from the Twitter Search API using RCurl and converts it to R lists using rjson.

<p>Video:


  Running the code requires some packages that you need to install.

twill package for web browsing, that installs a Python package for you. Requires the mechanize package as well. twill is a wrapper [...]]]></description>
			<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>This past Tuesday I had the opportunity to present a short talk (a bit long) related to text mining at the <a href="http://www.meetup.com/LAarea-R-usergroup/">Los Angeles R Users&#8217; Group</a>. Since I do most of my text mining in Python, I took this opportunity to discuss <a href="http://rpy.sourceforge.net/rpy2_download.html">RPy2, an interface to R from Python</a>.    My slides are below:<br />
<center></p>
<div style="width: 425px;" id="__ss_5548926"><strong style="display: block; margin: 12px 0pt 4px;"><a href="http://www.slideshare.net/bytemining/rpy2" title="Rpy2">Accessing R from Python using RPy2</a></strong><object id="__sse5548926" height="355" width="425"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=rpy2-101024223800-phpapp02&amp;stripped_title=rpy2&amp;userName=bytemining" /></object><br />
View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/bytemining">Ryan Rosario</a>.</div>
<p></center>    <br />
Download/view slides <a href="http://www.bytemining.com/wp-content/uploads/2010/10/rpy2.pdf">here</a>.    Topics include
<ul>
<li>Using Python with R with an example using web mining.</li>
<li>Web mining using pure R rather than Python. </li>
</ul>
<p>  <strong>Code</strong> for demonstration is here:
<ol>
<li><a href="http://www.bytemining.com/wp-content/uploads/2010/10/offtopic_demo.py">offtopic_demo.py</a> is a pure Python script that extracts data from a web forum and dumps it to disk. <strong>To actually use it, you will need to register for an account.</strong></li>
<li><a href="http://www.bytemining.com/wp-content/uploads/2010/10/RPy2_demo.py">RPy2_demo.py</a> reads the data from the forum from disk and calls R from Python to perform some basic analysis. </li>
<li><a href="http://www.bytemining.com/wp-content/uploads/2010/10/curljson_demo.r">curljson_demo.R</a> grabs some JSON data from the Twitter Search API using RCurl and converts it to R lists using rjson.</li>
</ol>
<p><strong>Video</strong>:<br />
<center><br />
<embed src="http://blip.tv/play/hoYTgob6UQA%2Em4v" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="480" width="640"></embed></center><br />
  Running the code requires some packages that you need to install.
<ul>
<li><a href="http://twill.idyll.org/">twill</a> package for web browsing, that installs a Python package for you. Requires the <a href="http://wwwsearch.sourceforge.net/mechanize/">mechanize</a> package as well. twill is a wrapper to mechanize.</li>
<li><a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> package for Python for HTML parsing.</li>
<li>R must be built to use as a shared library using <tt>--enable-R-shlib</tt>, otherwise Python cannot call it.</li>
<li><a href="http://rpy.sourceforge.net/rpy2_download.html">RPy2</a>, the Python interface to R.</li>
</ul>
<p>  To see the main talk of the evening, click here. </p>
<p><b>Some Recommended Books<br />
</b><br />
Natural Language Processing
<ul>
<li><a href="http://www.amazon.com/Foundations-Statistical-Natural-Language-Processing/dp/0262133601/ref=sr_1_2?ie=UTF8&amp;qid=1288050429&amp;sr=8-2">Foundations of Statistical Natural Language Processing</a>, Manning and Schuetze.</li>
<li><a href="http://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ref=sr_1_3?ie=UTF8&amp;qid=1288050480&amp;sr=8-3">Speech and Language Processing</a>, Jurafsky and Martin.</li>
<li><a href="http://www.amazon.com/Natural-Language-Processing-Text-Mining/dp/184628175X/ref=sr_1_7?ie=UTF8&amp;qid=1288050529&amp;sr=8-7">Natural Language Processing and Text Mining</a>, Kao and Poteet.
</li>
</ul>
<p>Text Mining</p>
<ul>
<li><a href="http://www.amazon.com/Practical-Mining-Wiley-Methods-Applications/dp/0470176431/ref=sr_1_1?ie=UTF8&amp;s=books&amp;qid=1288051476&amp;sr=8-1">Practical Text Mining with Perl</a>, Bilisoly. <a href="http://www.jstatsoft.org/v29/b09/paper">See my review of this book in the Journal of Statistical Software here which is also excerpted on Amazon</a>!
</li>
<li><a href="http://www.amazon.com/Text-Mining-Applications-Michael-Berry/dp/0470749822/ref=sr_1_2?ie=UTF8&amp;qid=1288050832&amp;sr=8-2">Text Mining: Applications and Theory</a>, Berry and Kogan (NEW).</li>
<li><a href="http://www.amazon.com/Text-Mining-Handbook-Approaches-Unstructured/dp/0521836573/ref=pd_sim_b_1">The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data</a>, Feldman and Sanger.</li>
<li><a href="http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&amp;s=books&amp;qid=1288051437&amp;sr=8-1">Mastering Regular Expressions</a>, Friedl.
</li>
</ul>
<p>Data Mining  </p>
<ul>
<li><a href="http://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576/ref=sr_1_1?ie=UTF8&amp;s=books&amp;qid=1288051520&amp;sr=8-1">Elements of Statistical Learning: Data Mining, Inference and Prediction</a>. Hastie, Tibshirani and Friedman.</li>
<li><a href="http://www.amazon.com/Data-Mining-Concepts-Techniques-Management/dp/1558609016/ref=sr_1_4?s=books&amp;ie=UTF8&amp;qid=1288051506&amp;sr=1-4">Data Mining: Concepts and Techniques</a> (recommended by <a href="http://www.twitter.com/nealrichter">@nealrichter</a>). Han, Kamber and Pei.&nbsp;</li>
<li><a href="http://www.amazon.com/Data-Mining-Practical-Techniques-Management/dp/0120884070/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1288051506&amp;sr=1-1">Data Mining: Practical Machine Learning Tools and Techniques</a> [the fern book]. Witten and Frank.</li>
<li><a href="http://www.amazon.com/Introduction-Data-Mining-Pang-Ning-Tan/dp/0321321367/ref=sr_1_3?s=books&amp;ie=UTF8&amp;qid=1288051506&amp;sr=1-3">Introduction to Data Mining</a> [the rock book]. Tan, Steinbach, Kumar.
</li>
</ul>
<p>Web Mining</p>
<ul>
<li><a href="http://www.amazon.com/Web-Data-Mining-Data-Centric-Applications/dp/3642072372/ref=sr_1_2?ie=UTF8&amp;qid=1288050588&amp;sr=8-2">Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data</a>, Liu.</li>
<li><a href="http://www.amazon.com/Mining-Web-Discovering-Knowledge-Hypertext/dp/1558607544/ref=sr_1_1?ie=UTF8&amp;qid=1288050634&amp;sr=8-1">Mining the Web: Discovering Knowledge from Hypertext Data</a>, Chakrabarti.</li>
<li><a href="http://www.amazon.com/Mining-Graph-Data-Diane-Cook/dp/0471731900/ref=sr_1_1?ie=UTF8&amp;qid=1288050692&amp;sr=8-1">Mining Graph Data</a>, Cook and Holder.</li>
<li><a href="http://www.amazon.com/Managing-Mining-Advances-Database-Systems/dp/1441960449/ref=sr_1_3?ie=UTF8&amp;qid=1288050728&amp;sr=8-3">Managing and Mining Graph Data</a>, Aggarwal and Wang.</li>
<li><a href="http://www.amazon.com/Social-Network-Analysis-Applications-Structural/dp/0521387078/ref=sr_1_2?ie=UTF8&amp;qid=1288050766&amp;sr=8-2">Social Network Analysis: Methods and Applications</a>, Wasserman and Faust.
</li>
</ul>
<div class="shr-publisher-397"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2010%2F10%2Faccessing-r-from-python-using-rpy2%2F' data-shr_title='Accessing+R+from+Python+using+RPy2'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2010/10/accessing-r-from-python-using-rpy2/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

