Senior Capstone, Looking Back

with No Comments

Abstract

An ever-present problem in the world of politics and governance in the United States is that of unfairly political congressional redistricting, often referred to as gerrymandering. One method for removing gerrymandering that has been proposed is that of using software to create nonpartisan, unbiased congressional district maps, and there have been some researchers who have done work along these very same lines. This project seeks to be a tool with which one can create congressional maps while adjusting the weights of various factors that it takes into account, and further evaluate these maps using the Monte Carlo method to simulate thousands of elections to see how ‘fair’ the maps are.

Software Architecture Diagram

As shown in the figure above, this software will create a congressional district map based off of pre-existing datasets (census and voting history) as well as user-defined factor weighting, which then goes under a Monte Carlo method of simulating thousands of elections in order to evaluate the fairness of this new map. The census data is used both for the user-defined factor weighting and for determining the likelihood to vote for either party (Republican or Democrat), which includes race/ethnicity, income, age, gender, geographical location (urban, suburban, or rural), and educational attainment. The voting history is based on a precinct-by-precinct voting history in Congressional races, and has a heavy weight on the election simulation.

Research Paper

The current version of my research paper can be found here.

Software Demonstration Video

A demonstration of my software can be found here.

current approach: GeoBurst method

with No Comments

The GeoBurst algorithm detects local news events by looking for spatiotemporal ‘bursts’ of activity. This cluster analysis uses methods which look at geo-tag clusters of phrases.

Phrase network analysis has been able to historically link user clouds, however the use of GPS in mobile devices has led many users of social media to indicate their wherabouts on a reliable basis. Clusters appear not only in the spatial proximity of phrases, but also in their temporal proximity. This is being compared to a recent history which is sampled from a ‘sliding frame’ of historic phrases.

Possible changes may emerge as I rework the sampling process, in order to account for larger historic contextualization from previous years of data, in order to compare seasonal events, such as famous weather systems or sports. In the case of my research, the events are sports (specifically Football). This is because sports are temporal events on Twitter which happen in a simultaneous manner in the USA, giving me lots of clusters to look at. Though politics would be a fun topic, it is not resolved well in my dataset which dates to 2013.

The pursuit of GeoBurst is eventually to work towards disaster relief, however the behaviour of humans may arguably not be directed to social media in some disasters. The objective being that existing cyberGIS infrastructure may benefit from social media and be used to inform disaster response decision making.

In the mean time, it’s time to get GeoBurst running and looking at the Twitter API.

Project Topic

with No Comments

For my Senior Research, my topic will be a data mining project using data collected from Twitter. Twitter’s API offers 1% of a spatial bandwidth (in my case, the continental U.S.A.) for users to collect. This data has been collected for over 3 years, and represents well over one billion tweets. Of these, a significant percentage of tweets contains at least one hashtag, which is one kind of data I will be looking at. The other datatype I have an interest in is geo-tags, which are an optional GPS coordinate which users may choose to include. Using machine learning algorithms, I hope to identify regular hashtags, in order to classify different kinds of signals based on hashtag frequency. The purpose of this is to see if I can predict hashtag occurrence, or whether hashtags are too noisy to classify or group into reliable frequencies.

My goal is to then study the noise, and to give that noise a geo-spatial context in which to understand the events which contributed to that noise.

Here’s a simple example:
Given that the State of Indiana tests tornado sirens on the first Tuesday of each month, it is likely that hashtags similar to #tornado or #siren appear in greater numbers on the same days as tests. This is a regular signal which could be reduced to a variability of +- 6 hours. This signal can be ignored. However, should a tornado strike on a different day, the sirens will go off, and #tornado or #siren might appear on an irregular day. The siren creates a spatial event which only affects the region which hears it, which might distinguish it from the more regular signals.

At a larger scale, looking at the noisy hashtags might give insights into real time, less predictable events. This can help de-obfuscate growing stories or events in real time, allowing us to find the meaningful information which hides under layers of signals.

I will be doing this research with David Barbella (Dave). Dave and I will be working with resources hosted by NCSA, including the CyberGIS Supercomputer ROGER (an XSEDE resource, for others that are interested).

Research project ideas

with No Comments

Possible research: Spatial computational resource allocation

see also: CyberGIS’16 panel

Data structures are fundamental to the efficiency of algorithms pertaining to transfer and storage, computation, and visualization. Parallel and distributed computing comes in many implementations whose purposes vary greatly. Using centralized computing networks, new resources are available to more institutions, however the bridge between onsite spatial data collection and offsite computing is uncertain, even in terms of data structuring. The changes in resolution and computational needs have brought bitmap and vector closer than ever, however the software resources rely on centralized resources, for which there are few designed for LiDAR terrain mapping.

Research topics:

1: Study data structures to store spatial information. Do aspects of existing structures resolve any problems faced by users?

2: Study whether spatial data compression could be implemented to improve computability and

3: Study methods for data browsing and distributed storage solutions. Big data systems may limit the filesizes remote end users can personally compute with, however some data must be represented by the remote end user.

Panel overview

with 1 Comment

Panel: Future Directions of CyberGIS and Geospatial Data Science (Chair: Shaowen Wang)
Panelists: Budhendra Bhaduri, Mike Goodchild, Daniel S. Katz, Mansour Raad, Tapani Sarjakoski, and Judy —

Selected topics by Ben Liebersohn

Michael:

  • 3D domains are limited, more GIS integration with 3D rendition and simulation be well received.
  • Support for different types of data, which is sometimes more proprietary or otherwise have limited longevity.
  • Can we do analysis of data which we need 3D representation in order to compute simulations with it. Not everything is just landscapes (possibly meaning >3 dimensions? -B).
  • Decision support systems need more types of data. We need the integration with the applications as well.
  • Real time data streams and distributed loads which serve local decisions on broader, better networked scales.

Judy:

  • Integration needs quantification of size, needs What do we envision as the problem, and the scope? What technology (hardware, network) is needed?
  • What does all this data mean? What do we do about it? This gets you closer to the science policy area.

 

Paul:

“As an outsider, when I see what’s going on in this community I ask: what unique problems is this community facing versus common problems? I presented networking and cloud stuff you may not have seen before. The application can drive the network and the compute resources. Flexible and scalable networks. Maybe both sides can help one another.”