Currently working on tagging the individual data fields in each message entry, and saving the newly tagged tweets to a new directory.
current approach: GeoBurst method
The GeoBurst algorithm detects local news events by looking for spatiotemporal ‘bursts’ of activity. This cluster analysis uses methods which look at geo-tag clusters of phrases.
Phrase network analysis has been able to historically link user clouds, however the use of GPS in mobile devices has led many users of social media to indicate their wherabouts on a reliable basis. Clusters appear not only in the spatial proximity of phrases, but also in their temporal proximity. This is being compared to a recent history which is sampled from a ‘sliding frame’ of historic phrases.
Possible changes may emerge as I rework the sampling process, in order to account for larger historic contextualization from previous years of data, in order to compare seasonal events, such as famous weather systems or sports. In the case of my research, the events are sports (specifically Football). This is because sports are temporal events on Twitter which happen in a simultaneous manner in the USA, giving me lots of clusters to look at. Though politics would be a fun topic, it is not resolved well in my dataset which dates to 2013.
The pursuit of GeoBurst is eventually to work towards disaster relief, however the behaviour of humans may arguably not be directed to social media in some disasters. The objective being that existing cyberGIS infrastructure may benefit from social media and be used to inform disaster response decision making.
In the mean time, it’s time to get GeoBurst running and looking at the Twitter API.
Project Topic
For my Senior Research, my topic will be a data mining project using data collected from Twitter. Twitter’s API offers 1% of a spatial bandwidth (in my case, the continental U.S.A.) for users to collect. This data has been collected for over 3 years, and represents well over one billion tweets. Of these, a significant percentage of tweets contains at least one hashtag, which is one kind of data I will be looking at. The other datatype I have an interest in is geo-tags, which are an optional GPS coordinate which users may choose to include. Using machine learning algorithms, I hope to identify regular hashtags, in order to classify different kinds of signals based on hashtag frequency. The purpose of this is to see if I can predict hashtag occurrence, or whether hashtags are too noisy to classify or group into reliable frequencies.
My goal is to then study the noise, and to give that noise a geo-spatial context in which to understand the events which contributed to that noise.
Here’s a simple example:
Given that the State of Indiana tests tornado sirens on the first Tuesday of each month, it is likely that hashtags similar to #tornado or #siren appear in greater numbers on the same days as tests. This is a regular signal which could be reduced to a variability of +- 6 hours. This signal can be ignored. However, should a tornado strike on a different day, the sirens will go off, and #tornado or #siren might appear on an irregular day. The siren creates a spatial event which only affects the region which hears it, which might distinguish it from the more regular signals.
At a larger scale, looking at the noisy hashtags might give insights into real time, less predictable events. This can help de-obfuscate growing stories or events in real time, allowing us to find the meaningful information which hides under layers of signals.
I will be doing this research with David Barbella (Dave). Dave and I will be working with resources hosted by NCSA, including the CyberGIS Supercomputer ROGER (an XSEDE resource, for others that are interested).
Research project ideas
Possible research: Spatial computational resource allocation
see also: CyberGIS’16 panel
Data structures are fundamental to the efficiency of algorithms pertaining to transfer and storage, computation, and visualization. Parallel and distributed computing comes in many implementations whose purposes vary greatly. Using centralized computing networks, new resources are available to more institutions, however the bridge between onsite spatial data collection and offsite computing is uncertain, even in terms of data structuring. The changes in resolution and computational needs have brought bitmap and vector closer than ever, however the software resources rely on centralized resources, for which there are few designed for LiDAR terrain mapping.
Research topics:
1: Study data structures to store spatial information. Do aspects of existing structures resolve any problems faced by users?
2: Study whether spatial data compression could be implemented to improve computability and
3: Study methods for data browsing and distributed storage solutions. Big data systems may limit the filesizes remote end users can personally compute with, however some data must be represented by the remote end user.
Panel overview
Panel: Future Directions of CyberGIS and Geospatial Data Science (Chair: Shaowen Wang)
Panelists: Budhendra Bhaduri, Mike Goodchild, Daniel S. Katz, Mansour Raad, Tapani Sarjakoski, and Judy —
Selected topics by Ben Liebersohn
Michael:
- 3D domains are limited, more GIS integration with 3D rendition and simulation be well received.
- Support for different types of data, which is sometimes more proprietary or otherwise have limited longevity.
- Can we do analysis of data which we need 3D representation in order to compute simulations with it. Not everything is just landscapes (possibly meaning >3 dimensions? -B).
- Decision support systems need more types of data. We need the integration with the applications as well.
- Real time data streams and distributed loads which serve local decisions on broader, better networked scales.
Judy:
- Integration needs quantification of size, needs What do we envision as the problem, and the scope? What technology (hardware, network) is needed?
- What does all this data mean? What do we do about it? This gets you closer to the science policy area.
Paul:
“As an outsider, when I see what’s going on in this community I ask: what unique problems is this community facing versus common problems? I presented networking and cloud stuff you may not have seen before. The application can drive the network and the compute resources. Flexible and scalable networks. Maybe both sides can help one another.”