Project Topic

with No Comments

For my Senior Research, my topic will be a data mining project using data collected from Twitter. Twitter’s API offers 1% of a spatial bandwidth (in my case, the continental U.S.A.) for users to collect. This data has been collected for over 3 years, and represents well over one billion tweets. Of these, a significant percentage of tweets contains at least one hashtag, which is one kind of data I will be looking at. The other datatype I have an interest in is geo-tags, which are an optional GPS coordinate which users may choose to include. Using machine learning algorithms, I hope to identify regular hashtags, in order to classify different kinds of signals based on hashtag frequency. The purpose of this is to see if I can predict hashtag occurrence, or whether hashtags are too noisy to classify or group into reliable frequencies.

My goal is to then study the noise, and to give that noise a geo-spatial context in which to understand the events which contributed to that noise.

Here’s a simple example:
Given that the State of Indiana tests tornado sirens on the first Tuesday of each month, it is likely that hashtags similar to #tornado or #siren appear in greater numbers on the same days as tests. This is a regular signal which could be reduced to a variability of +- 6 hours. This signal can be ignored. However, should a tornado strike on a different day, the sirens will go off, and #tornado or #siren might appear on an irregular day. The siren creates a spatial event which only affects the region which hears it, which might distinguish it from the more regular signals.

At a larger scale, looking at the noisy hashtags might give insights into real time, less predictable events. This can help de-obfuscate growing stories or events in real time, allowing us to find the meaningful information which hides under layers of signals.

I will be doing this research with David Barbella (Dave). Dave and I will be working with resources hosted by NCSA, including the CyberGIS Supercomputer ROGER (an XSEDE resource, for others that are interested).

Leave a Reply