This past Tuesday I had the opportunity to present a short talk (a bit long) related to text mining at the Los Angeles R Users’ Group. Since I do most of my text mining in Python, I took this opportunity to discuss RPy2, an interface to R from Python. My slides are below:
Accessing R from Python using RPy2
View more presentations from Ryan Rosario.
Download/view slides here. Topics include
Using Python with R with an example using web mining.
Web mining using pure R rather than Python.
Code for demonstration is here:
offtopic_demo.py is a pure Python script that extracts data from a web forum and dumps it to disk. To actually use it, you will need to register for an account.
RPy2_demo.py reads the data from the forum from disk and calls R from Python to perform some basic analysis.
curljson_demo.R grabs some JSON data from the Twitter Search API using RCurl and converts it to R lists using rjson.
Running the code requires some packages that you need to install.
twill package for web browsing, that installs a Python package for you. Requires the mechanize package as well. twill is a wrapper [...]
A couple of weeks ago, Bradford Cross of FlightCaster posted in Measuring Measures that transactions are the next big data category. I argue that they already are, and from reading his blog post, he seems to suggest this as well but I will admit that I think I missed his point. There are some clear examples of transactions and their importance:
Itemset Mining. Cross discusses this in his article. Financial transactions on sites like Amazon contain items (merchandise). Using these transactions, Amazon built a recommendation engine to recommend new items to customers on their website, and even customize deals for customers via email and on the site.
Wireless Localization. Fantasyland at The Magic Kingdom in Walt Disney World was to undergo a big overhaul to provide a personalized experience on transactions through the park. An RFID chip would be included in a ticket (or some type of document) and the visitor’s information from a survey would be transmitted to the attraction’s intelligent system. Such a system would also provide Disney a wealth of information about what attractions certain audiences visit, when, how often, and even what items a visitor may purchase during the day.
Website Conversion Path Optimization. A visit to a website [...]
When I was a kid, I went through an 80s music phase…well, some things never change. “People just love to play with words…” Know that song? Anyway…
One of the biggest pains of text mining and NLP is colloquialism — language that is only appropriate in casual language and not in formal speech or writing. Words such as informal contractions (“gonna”, “wanna”, “whatcha”, “ain’t”, “y’all”) are colloquialisms and are everywhere on the Web. There is also a great deal of slang common on the Web including acronyms/emoticons (“LOL”, “WTF”) and smilies that add sentiment to text. There is also a less used slang called leetspeak that replaces letters with numbers (“n00b” rather than “noob”, “pwned” instead of “owned” and “pr0n” instead of “porn”).
There are also regionalisms which are a pain for semantic analysis but not so much for probabilistic analysis. Some examples are pancakes (“flapjacks”, “griddlecakes”) or carbonated beverages (“soda”, “pop”, “Coke”). Or, little did I know, “maple bars” vs. “Long Johns”. Now I am hungry. There are also words that have a formal and informal meeting such as “kid” (a young goat, or a child…same thing).
Linguists consider colloquialisms different than slang. Slang is informal language used by a specific [...]