Senior Capstone

with No Comments


Welcome everyone! My name is Evan Griswold. I am a collegiate athlete majoring in Computer Science at Earlham College and have a high interest in Cybersecurity & Networking/IT.


This project focuses on developing an AI-powered music recommendation system by harnessing diverse music data sources with the main dataset Million Song Dataset (MSD) and having sub-datasets within MSD involving cover songs, lyrics, user data, genre labels,and similarity. Employing content based filtering and deep learning together will achieve the best results, providing users with personalized and diverse music recommendations. The objective is to create an efficient music recommendation system that enhances user music discovery and satisfaction.

Karolayne Gaona

with No Comments

Pet identification is important for veterinary care, pet ownership,
and animal welfare and control. This proposal presents a solution
for identifying dog breeds using dog images. The proposed method
applies a deep learning approach to identify the dog breeds. The
starting point for this method is transfer learning by retraining
existing pre-trained convolutional neural networks on the Stanford
Dog database. Three classification architectures will be used. These
classifiers will take images as input and generate feature matrices
based on their architecture. The stages these classifiers will undergo
to create feature vectors are 1) Convolution to generate feature
maps and 2) Max Pooling: highlight features are extracted from
the feature maps. Data augmentation is applied to the database to
improve classification performance.

Building a textual dataset for the Generative Design in Minecraft Competition

with No Comments


This paper presents the structure and design of a novel open-source textual database named the Brick Database for Minecraft settlements. Aimed at addressing the current lack of comprehensive datasets for model training in the Generative Design in Minecraft Competition (GDMC), this study explores the methodologies and potential challenges in creating a textual database and automatically detecting buildings and rooms.


Software demonstration:


Data diagram:

Final Three Pitches

with No Comments
  1. Machine learning to automatized music composition

This project aims to build a system that uses machine learning algorithms to generate original music compositions based on original ones. This project will be studying machine learning techniques for pattern recognition. 

The feedback I received from this seems to be more approachable than my first pitch, but I haven’t been able to receive more feedback because nobody has experience with this matter. 

However, Doug questioned if there are Machine learning packages that I could use to build my project. Because in my project I aim to produce a new product from already existing ones. So, do we have packages to generate a new thing? It is possible to configure a train to produce something new. 

I am currently getting in contact with the music department because there is a professor who teaches a class, “Making Music with Computer.” 

I learned about software for music composition called “Max” which is a visual programming language for music and multimedia. I also learn about authors as David Cope, who he has a lot of work regarding artificial intelligence in music. I also investigated different types of music algorithms for composition and am currently looking for a specific problem or field to explore in this area. 

Figure out because off all the complexity of this keep things simple.

  • translational models
  • mathematical models
  • knowledge-based systems
  • grammars
  • optimization approaches
  • evolutionary methods
  • systems which learn
  • hybrid systems
  1. Detecting non-child-appropriate content in videos. 

Nowadays, it is very easy to get access to a lot of content online, including music, articles, games, videos, and movies. In the last couple of years, the use of video platforms such as YouTube has become very important for education, entertainment, hobbies, etc. However, YouTube is a platform where anyone can upload content, and not all content is appropriate to everyone. It can be because they may include sensitive content or bad words. This project will aim to use deep learning and machine learning for video analysis. It will require image processing to process video clips. It can also work with pattern recognition. This project aims to detect images and audio from the videos. 

Challenge: video processing, 

Data set of video file tag if they are not appropriate for children. 

Potential dataset: (Restricted) (adult content detention)

Youtube Data API (meta data)

Good Resources
  1. Real-time sign language translator to text.

This project aims to use machine learning to accurately translate real-time sign language to text by capturing sign language gestures using a camera (more likely from a computer’s web camera) to text. This system should be able to recognize various signs. I would like to study computer vision techniques. 

From the input I received from Doug, this is a very ambitious project. We thought about what would be more convenient. Translate from moves gestures to text or from text to moves gestures. The first option seems to be more complicated because computer vision techniques can be a vast area. More than understanding computer vision techniques, but will also be required to recognize sign language and the context. To be able to train machine learning, I will require a lot of data that may not be easy to obtain. 

This project will aim to have a friendly user interface. So it will be easy for the user to navigate through the translator.
  1. Machine leaning for the prediction of stroke diseases

Stroke is a cerebrovascular disease and is a significant causes of death. It causes significant health and financial burdens for the individual and the system. There are many machine-learning models built to predict the risk of stroke or to automatically diagnose stroke using predictors such as life factors or images. However, there has not been an algorithm that can predict using lab tests.

CS 388 Momo Hirose Pitches

with No Comments

Idea #1 

My research question: 

What algorithms can be used to simulate policy making?

Public policy should obviously be for all citizens, but people often feel that their voices and real needs are not reflected in policy. In my home country of Japan, I have witnessed many voices on Twitter (now called X) about the policies they want from the government. I would like to create software that can use government data to simulate and predict what the impact would be if the government actually implemented those policies. Such a tool would help in better evidence-based policy making.

Idea #2

My research question: 

How can we combine speech recognition, AI, and data extraction to automatically and instantly display the data mentioned in public discussions?

In Japan, when public policies are made, they have to be discussed in the Council or the Diet (Japanese national congress) in order to be approved. However, these discussions are often not easy to understand for the citizens for the following two reasons. 

  1. Politicians tend to criticise each other for trivial words or actions, so it is hard to understand the points of what is being discussed.
  2. What we, the citizens, see on television is only politicians arguing. Without visual aids it is even harder to follow the discussions. 

If the data (graphs and numbers) are automatically displayed when politicians refer to data by recognising the voice, policy-making will be more evidence-based and discussions will be easier for citizens to understand.

  • An example of a specific issue I would like to look at is Japan’s declining birth rate – this is often discussed in the National Diet, but because it is such a broad topic, the discussions sometimes get lost without data.
    • * This link above is the website where you can access all the archives of live broadcasts of the National Diet.
    • * The videos cannot be downloaded (only available for watching in the web browser) 
      • This government official website shows text data of the questions and answers that were discussed in the National Diet. They are listed by topics and available in PDF and HTML format. (* Only in Japanese, but I can translate for the purpose of this research) 

Idea #3  Computational Modeling of National Budget

My research question: 

What algorithms can be used to better allocate the national budget?

What I realized while working with the government agencies in Japan last summer was that the policy-making process focuses heavily on the rationale of policies, and after they are approved, the evaluation is not as important as it should be. This could lead to a situation where the national budget continues to be allocated to policies that are not effective. If we can create a system to quantify the impact of each policy and also get comments from citizens, and use an algorithm to allocate the national budget, I think we could have a better cycle of public policy.

  • Since examining the entire national budget could be huge, here’s a specific topic I’m interested in investigating: 
    • The budget for developing the startup ecosystem in Japan
      • How is it currently allocated to each policy?
      • How effective are these policies?
      • How should the budget be allocated in the future, using computer modelling?
      • * Data and information on the development of the startup ecosystem and its budget can be found at the Japanese Ministry of Economy, Trade and Industry.

6 Annotated bibliography – Tien

with No Comments

Facial Expression Recognition: This pitch will focus on facial expression recognition (FER) using convolutional neural networks (CNNs), which helps people with difficulties in communication, analyzes human’s emotion, and helps with development of AI in supporting humans during their daily life. Dataset: CK+ : 48×48 pixels images in grayscale format; face cropped; emotions includes anger (45 samples), disgust (59 samples), fear (25 samples), happiness (69 samples), sadness (28 samples), surprise (83 samples), neutral (593 samples), contempt (18 samples). Tufts Face Database: multi-modal face image images with more than 100,000 images, 74 females and 38 males from different age groups. 

Suppressing uncertainties for large-scale facial expression recognition

Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, and Yu Qiao. 2020. Suppressing uncertainties for large-scale facial expression recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020), 6897–6906.

This paper addressed a specific problem in large-scale FER, which is “uncertainties caused by ambiguous facial expression, low-quality facial images, and the subjective of the annotators” by using the Self-Cure Network. Because it focuses on a problem in FER, it is a good example if we want the purpose of our proposal to be about addressing a problem in the domain. It also mentions a lot of good dataset for FER  along with works on FER using algorithms in the citation, which are good resources for our proposal. Related works in the field are also provided and went into detail to showcase the problem that the paper is focusing on. The methods that were proposed in the paper are based on the observation that CNNs can be uncertain about their predictions. 

Deep-emotion: Facial expression recognition using attentional convolutional network

Shervin Minaee, Mehdi Minaei, and Amirali Abdolrashidi. 2021.Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network. Sensors 21, 9, 3046.

This paper is about FER using attentional CNNs. It discusses the challenges of FER and how attentional CNNs can be used to address them. The author proposed a new attentional CNN architecture that is able to focus on important facial regions for emotion detection. Because I decided to use CNN as the main method for processing the data in my paper, I can consider using the proposed attention CNNs from this paper. The proposed attentional CNN architecture consists of 2 main components: a feature extraction network and an attention network. The feature extraction network extracts features from the input image, while the attention network learns to focus on the most important facial regions for emotion detection.

Facial expression recognition: A survey

Yunxin Huang, Fei Chen, Shaohe Lv, and Xiaodong Wang. 2019. Facial expression recognition: A survey. Symmetry 11, 10, 1189. MDPI.

This paper is a survey for FER of visible facial expressions, which provides a lot of necessary background knowledge like terminologies and difficulties in the field. It also provided a throughout FER approach from beginning to end process including Image processing, feature extraction, gabor feature extraction, and expression classification. CNNs, Deep Belief Network, Long Short-Term Memory, Generative Adversarial Network are introduced and cited with current works related to them.

Facial emotion recognition using transfer learning in the deep CNN

M. A. H. Akhand, Shuvendu Roy, Nazmul Siddique, Md Abdus Samad Kamal, Testuya Shimamura. 2021. Facial emotion recognition using transfer learning in the deep CNN. Electronics 10, 9, 1036. MDPI.

This paper focuses on Deep CNN and Transfer Learning (TL). CNN is a popular technique used for FER and it is one that I’m considering moving forward with. This paper also focuses on using these techniques to reduce the development efforts, which is an understanding problem that all of the previous ones haven’t touched on. FER systems need to be able to handle occlusion, noise, and other challenges. Deep CNNs have been shown to be effective for FER tasks. CNNs are able to learn complex features from images, which can be helpful for identifying FER. Transfer learning is a technique where a pre-trained model is used as a starting point for a new model. This can be useful for tasks where there is limited training data available. In the paper, they introduced the technique of adopting a pre-trained Deep CNN model and replacing its dense upper layer(s) compatible with FER, and then fine-tuning the model with facial emotional data. This approach has been shown to achieve remarkable accuracy on both the FDEF and JAFFE facial image datasets.

Local multi-head channel self-attention for facial expression recognition

Roberto Pecoraro, Valerio Basile, and Viviana Bono. 2022. Local multi-head channel self-attention for facial expression recognition. Information 13, 9, 419. MDPI.

This paper proposed Local multi-head Channel self-attention (LHC) in the context of computer vision and in facial expression recognition. LHC is a very new approach in the field of FER. This paper will be useful because the LHC module is a type of self-attention module that can be integrated into CNNs, and it has been shown to improve the performance of CNNs on FER tasks. In a CNN, each layer learns to extract different features from the input image. The LHC module allows the CNN to learn long-range dependencies between features, which can be helpful for FER. 

Guided open vocabulary image captioning with constrained beam search

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Guided open vocabulary image captioning with constrained beam search. arXiv preprint arXiv:1612.00576.

I read this paper for another class, and I think it is pretty interesting if I can incorporate this into my paper. The paper proposed a method for improving the performance of open vocabulary image captioning models. Open vocabulary image captioning models are able to generate captions that contain words that are not present in the training data. The author introduced a technique called constrained beam search to guide the generation of captions. Contrainsed beam search forces the generated captions to include certain words. FER uses words “happy”, “sad”, “angry”, and “neutral”, constrained beam search could be used to force the system to predict at least one of these words in each prediction. 

Sentiment analysis of online food reviewWhen people buy products online, the one thing that they tend to look closely at are the reviews of the product. Having good reviews and understanding the needs of the customers through the review can help the business grow tremendously. A review usually comes in two parts: the rating and the reviews. While the text-review system can be easy to interpret the customers’ overall experiences without any biases, the star-rating system tends to be less informative and it is up to the viewer to interpret the rating. This project aims to perform sentiment analysis on the reviews dataset, so it provides more accurate feedback on the products. Another aspect that this project can develop toward is that it can perform analysis on the negative languages on the recent reviews to let the businesses know what they should focus on improving. Amazon Fine Food Reviews dataset, which contains data over a 10 year period (1999 to 2012). Another plan to replace this dataset is using DoorDashAPI to collect reviews from the restaurants on DoorDash.

Comparative study of deep learning models for analyzing online restaurant reviews in the era of the COVID-19 pandemic

Yi Luo, and Xiaowei Xu. 2021. Comparative study of deep learning models for analyzing online restaurant reviews in the era of the COVID-19 pandemic. International Journal of Hospitality Management 94, 102849. Elsevier.

The paper performs analysis on four features of 112,412 restaurants on Yelp and shows outcome comparison between algorithms. The data are collected by using a web scraper, which is a method that we proposed for our paper if we can’t find a more recent dataset. They also mentioned the process of data cleaning, which includes 2 procedures: tokenization and stopwords removal. They provided 2 deep learning and machine learning algorithms: gradient boosting decision tree classifier, random forest classifier, bidirectional LSTM, and simple word-embedding model. They also proposed both theoretical and practical implications for future work, which is a good place to find motivation for our paper.

A survey on sentiment analysis methods, applications, and challenges

Mayur Wankhade, Annavarapu Chandra Sekhara Rao, and Chaitanya Kulkarni. 2022. A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review 55, 7, 5731-5780.

This paper provided a background on sentiment analysis like the survey for FER, which also gives us a good start to understanding how to approach sentiment analysis. It provides a throughout beginning-to-end process of a sentiment analysis. I find it useful when reading about the 3 main approaches, which are Lexicon Based Approach, Machine Learning Approach, and Hybrid Approach. It also introduced some common machine learning algorithms for sentiment analysis: Naive Bayes, support vector machines, and deep learning. Naive Bayes is easy to implement and can be trained on a relatively small dataset. Support vector machines can be trained to achieve high accuracy, but they can be more difficult to implement and require a larger dataset than Naive Bayes. Deep learning algorithms like recurrent neural networks and CNNs are more difficult to implement and require a large dataset to train.

Sentiment analysis of restaurant reviews using hybrid classification method

M. Govindarajan. 2014. Sentiment analysis of restaurant reviews using hybrid classification method. Sentiment analysis of restaurant reviews using hybrid classification method 2, 1, 17-23.

This paper compared the effectiveness of different methods made for classifying restaurant reviews and whether it is beneficial to use ensemble techniques. It includes methods like Naive Bayes, Support Vector Machine and Genetic Algorithm. I think they provide a broad look at what methods are available for classification and that can be helpful for our paper, but I am not sure if it will be useful for our paper since we are not planning to explore classification in restaurant reviews but more about the sentimental analysis of it. However, the amount of paper available on the topic that I proposed is quite limited. If we can get access to one of the paper I listed below, they can be a good resource since it is more related to the direction I want to develop my proposal.

Sentiment analysis of customer reviews of food delivery services using deep learning and explainable artificial intelligence: Systematic review

Anirban Adak, Biswajeet Pradhan, and Nagesh Shukla. 2022. Sentiment analysis of customer reviews of food delivery services using deep learning and explainable artificial intelligence: Systematic review. Foods 11, 10, 1500. MDPI.

This paper focuses on AI and DL for sentiment analysis. They explained more about how AI, DL, ML are developing within each other. I think this paper acts as an overall guide on the techniques that I should use for my paper. It includes information about different AI methods that are used in sentiment analysis of customer reviews for food delivery services and also challenges when using DL techniques on customer reviews. 

“His lack of a mask ruined everything.” Restaurant customer satisfaction during the COVID-19 outbreak: An analysis of Yelp review texts and star-ratings

Maria Kostromitina, Daniel Keller, Muhittin Cavusoglu, and Kyle Beloin. 2021. “His lack of a mask ruined everything.” Restaurant customer satisfaction during the COVID-19 outbreak: An analysis of Yelp review texts and star-ratings. International journal of hospitality management 98, 103048. Elsevier.

This paper is similar to what I might want to do for my pitch: it includes background information about the review text in relation to the choice of star-ratings. It also provides an interesting aspect of how Covid-19 affected the reviews of the customers, and how, depending on the situations, the reviews that the customers read might not help them make the right decision of choosing a good restaurant.

Sentiment analysis of customers’ reviews using a hybrid evolutionary svm-based approach in an imbalanced data distribution

Ruba Obiedat, Raneem Qaddoura, Al-Zoubi Ala’M, Laila Al-Qaisi, Osama Harfoushi, Mo’ath Alrefai, and Hossam Faris. 2022. Sentiment analysis of customers’ reviews using a hybrid evolutionary svm-based approach in an imbalanced data distribution. IEEE Access 10, 22260–22273. IEEE.

This paper, other than proposing new techniques for sentiment analysis, addresses the problem of imbalance dataset, which is a common problem to encounter in data analysis. Even thought, the paper doesn’t perform the work on an English-based dataset, it is useful to see how they deal with imbalance data problems. They use Naive Bayes, SVM, and Genetic Algorithm on the dataset, then compare with their proposed hybrid model built with all three classification methods

3 Pitches – Polished version

with No Comments

Pitch 1: Reduce the encounter of local minima in heuristic search space. In the heuristic search space of heuristic algorithms, there are areas where the nodes appear to be closer to the goal state, but when the algorithm encounters these areas, they actually wasted more resources to reach the goal. These areas are called local minima. Local minima tends to arise more when using distance to go as the heuristic value instead of cost to go, which is an aspect that this project plans to explore. Avoiding local minima and understanding the behavior can help improve the performance of heuristic search algorithms. Overall, this project will explore the problem of local minima with beam search. 

Risk: There is limited information on this topic and a lot of knowledge gaps that will need to be filled. 

Pitch 2: Facial expression recognition: classify facial expression using machine learning. Problem: help people in communication, analyze human’s emotion, help with the development of AI in supporting humans during their daily life. Dataset: CK+ : 48×48 pixels images in grayscale format; face cropped; emotions includes anger (45 samples), disgust (59 samples), fear (25 samples), happiness (69 samples), sadness (28 samples), surprise (83 samples), neutral (593 samples), contempt (18 samples). Tufts Face Database: multi-modal face image images with more than 100,000 images, 74 females and 38 males from different age groups. 

Risk: Finding a suitable machine learning algorithm, process image data.

Pitch 3: Sentiment analysis of online food review. When people buy products online, the one thing that they tend to look closely at are the reviews of the product. Having good reviews and understanding the needs of the customers through the review can help the business grow tremendously. A review usually comes in two parts: the rating and the reviews. Users, a lot of the time, mistakenly choose the wrong rating for the products, so the reviews are more reliable in most situations. This project aims to perform sentiment analysis on the reviews dataset, so it provides more accurate feedback on the products. Another aspect that this project can develop toward is that it can perform analysis on the negative languages on the recent reviews to let the businesses know what they should focus on improving. Amazon Fine Food Reviews dataset, which contains data over a 10 year period (1999 to 2012). Another plan to replace this dataset is using DoorDashAPI to collect reviews from the restaurants on DoorDash.

Risk: dataset is not recent and will keep looking for a more recent dataset.

Annotated Bibliography

with No Comments
  1. Pitch #1: “Online Child Safety: Using deep learning to detect inappropriate video content for children.”: 

In today’s digital world, where online content is easily accessible, my project aims to use deep learning and machine learning techniques, such as Neural Networks, to tackle the challenge of inappropriate content on video platforms like YouTube. Using advanced image processing and pattern recognition, my goal is to detect and flag unsuitable imagery and audio within videos. With a focus on creating a child-safe environment, I aim to build a comprehensive dataset of video file tags, making online platforms more secure for users of all ages and fostering a responsible and enjoyable digital experience. 

Research question: To improve the accuracy and efficiency of content classification on YouTube, can I compare and evaluate the performance of different deep learning architectures, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) versus human judgment, in the context of detecting inappropriate content? What are the trade-offs in terms of computational resources and real-time processing?

Datasets: (Restricted) (adult content detention)

YouTube Data API (metadata)

Good Resources:
  1. Bringing the Kid back into YouTube Kids: Detecting Inappropriate Content on Video Streaming Platforms
[1] Tahir, Rashid, Faizan Ahmed, Hammas Saeed, Shiza Ali, Fareed Zaffar, and Christo Wilson. 2019. “Bringing the Kid Back into YouTube Kids: Detecting Inappropriate Content on Video Streaming Platforms.” In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 464-469.

  • Nowadays, YouTube (YT) has become a very popular platform used by a lot of people, including children. YT has a section for children called YT Kids, but sometimes inappropriate children’s content may be leaked onto the platform. In this research, they collected data manually from a range of videos and used deep learning as a filter mechanism.
  • The researchers designed and implemented a deep learning model just for the task of content classification. They used a variety of deep learning models, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), to analyze elements of the content of the videos, including visual, audio, and motion. These model were chosen for their ability to learn complex patterns from raw data.
  • The management of video data is more complex because there are more elements to evaluate than a photo, for example, so they used a pre-trained CNN (VGG19) to extract relevant visual features from individual frames in a video. They used a type of RNN Bidirectional Long short-term memory (LSTM) to capture temporal relations in the motion features in videos, and finally, to process the audio of the videos, they used spectrograms to extract relevant audio and then deep learning models can analyze audio features to detect inappropriate language, words, etc. 
  • The way this research processed and classified their data was that the extracted motion, audio, and visual elements of each scene were put together into a single feature vector for that scene. Then, this single feature vector is fed into a fully connected neural network layer, followed by a softmax layer for classification. Deep learning models can effectively process and combine all these features to make decisions about the content of each scene. 
  • This paper is relevant to my pitch because my project aims to use deep learning and machine learning for video analysis, which is the research of this paper which also deals with the issue of content filtering and detection. 
  1.  A Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos
[2] Yousaf, Kanwal, and Tabassam Nawaz. 2022. “A Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos.” IEEE Access 10 (2022): 16283-16298.

  • This paper talks about YT being a platform that attracts malicious uploaders who share inappropriate content targeting children. To address this issue, this article proposes a deep learning-based architecture for detecting and classifying inappropriate content video by using an ImageNet pre-trained CNN model, specifically EfficientNet-B7 to extract video descriptors. This descriptor is then processed by a bidirectional long-short-term memory (BiLSTM) network to learn effective video representation and video classification. An attention mechanism is also incorporated into the network. 
  • Their proposed model uses a dataset of cartoon clips collected from YouTube videos, and the results are compared with traditional machine learning classifiers.
  • This paper’s methodology is processed in three steps. Video processing in which YouTube videos are split into small clips and labeled through manual annotation. Then, the CNN model and pre-trained on ImageNet is used to extract deep features from processed video frames. These features are also used to feed the BiLSTM network. The extracted videos are processed by BiLSTM Network to learn video representations. The output is then used for multiclass video classification.
  • The data set of this paper contained a total of 111,561 video clips. Out of that number, 57,908 clips were labeled as safe, 27,003 as sexual nudity, and 26,650 as fantasy violence. This distribution was made to ensure a balanced dataset for training and evaluation. 
  • The researchers compared different classifiers, such as Support Vector Machines (SVM), k-nearest Neighbors (KNN), and Random Forest that use EfficientNet. They found out that these machine learning classifiers were outperformed by their deep learning model.
  • This article mentions that for future work, combining temporal and spatial features and increasing the number of classifications, their model can be even better.
  • My project focuses on the same problem domain as the article’s. My pitch attempts to use machine learning and deep learning as well. This article could potentially provide me with guidance for dataset collection.
  • This article cited the previous article, “Bringing the Kid Back into YouTube Kids: Detecting Inappropriate Content on Video Streaming Platforms.”
  • This article is directly related to my area of research. I could use as a framework some of the methodologies from this paper because it is related to the use of the CNN model. 
  1. Evaluating YouTube videos for young children
[3] Neumann, Michelle M, and Christothea Herodotou. 2020. “Evaluating YouTube Videos for Young Children.” Education and Information Technologies 25 (2020): 4459-4475.

  • This article discusses the evaluation of YT videos for small children. The main concern of this article is about the quality of the appropriate quality of video content for children. This paper proposes to provide a framework for evaluating the quality of YT videos for young children, saying that this information can be used by educators and YouTube creators. I found this information relevant to my pitch because it can provide me guidance for video classification.
  • The article discussed YT official labeling for the classification of children’s content. The author proposes that these classifications do not necessarily study how children learn from screen media.
  • This study studies children’s interaction with screen media, develops a rubric for YT videos, and uses this rubric on YT videos.
  • The rubric they made is based on criteria related to age appropriateness, content quality, design features, and learning objectives.
  • They tested five videos with different content by human judgment. 
  • This paper can provide a framework to understand a clear understanding of what is appropriate for children on YT videos.
  • This paper discusses the use of Cohen’s Kappa to measure inter-rater reliability. In my project, I will probably be dealing with a large dataset, I can use a similar statistical method to asses between automated judgment and human judgment. 
  1. A Deep-Learning Framework For Accurate And Robust Detection Of Adult Content 
[4] Kusrini, Kusrini, Arief Setyanto, I Made Artha Agastya, Hartatik Hartatik, Krishna Chandramouli, and Ebroul Izquierdo. 2022. “A Deep-learning Framework for Accurate and Robust Detection of Adult Content.” Journal of Engineering Science and Technology. Engg Journals Publication.

  • This paper highlights the importance of filtering sensitive media, such as pornography and explicit content in general, on the internet. It talks about the negative consequences of exposing children to this content. 
  • The paper discusses old alternatives to solve this issue, such as IP-based blocking, textual analysis, and statistical color models for skin detection. However, this paper introduces the idea of using and transitioning to deep learning. To go from handcrafted features to deep neural networks. 
  • This paper proposes to use the deep-learning framework and spatial and temporal characteristics of video sequences for adult content detection. The framework is based on CNN architecture called Inception-v3 for spatial feature extraction. Temporal characteristics are modeled using long-term short memory (LSTM). The authors used different deep-learning network architectures such as VGGNET, RESNET, and Inception. Inception-v3 was selected to be the most efficient in the feature extraction. 
  • The Dataset is the NPDI dataset, which contains explicit (pornographic) content as well as non-explicit content. This dataset is used for training and classification. 
  • After extracting features from images or video frames using color distribution, shape information, skin likelihood, etc., clustering techniques (k-means) and classifiers (Support Vector Machines, SVM, etc.) are used to separate explicit from non-explicit content. 
  • They used HueSIFT and space-time interest points with SVM and Random forest for better classification using statistical machine learning algorithms. 
  • The paper also mentions that deep learning models such as AlexNet, GoogleLeNet have higher accuracy compared to traditional machine learning methods. 
  • The results said that the deep leaning based approach outperforms previous methods and achieves high accuracy, mostly using LSTM networks. 
  • This paper has great visuals that may help me better orient when creating the visuals and graphs for my paper. I probably won’t be using the dataset from this paper, but it is a great source to have a better understanding of how LSTM works and how could implemented in CNN architecture.
  1. Disturbed YouTube for Kids: Characterizing and Detecting Inappropriate Videos Targeting Young Children
[5] Papadamou, Kostantinos, Antonis Papasavva, Savvas Zannettou, Jeremy Blackburn, Nicolas Kourtellis, Ilias Leontiadis, Gianluca Stringhini, and Michael Sirivianos. 2020. “Disturbed YouTube for Kids: Characterizing and Detecting Inappropriate Videos Targeting Young Children.” In Proceedings of the International AAAI Conference on Web and Social Media, 14:522-533.

  • This paper mentions that many YT channels are for young children, but there is a significant amount of inappropriate content that is also targeted at young children.
  • YT recommendation system sometimes suggests not the best content for children.
  • This research collected a dataset of videos, including inappropriate and for children as well as random videos, and classified the videos as suitable, disturbing, restricted, or irrelevant.
  • A deep learning classifier is developed to detect disturbing videos automatically. 
  • The dataset is collected using YT Data APU and multiple approaches to obtain seed videos. Manual annotation is performed to label videos. 
  • This project starts with the seed videos as a starting point. These seed videos are videos that are appropriate for young children. Then, randomly choose from the recommended videos and choose the videos recommended by the platform. Then, a trained binary classifier is used to predict if the recommendation is appropriate or not. Keep track of whether the next video is appropriate or not. They analyze these random walks and product statistics. 
  • The authors trained a deep-learning classifier to classify the videos automatically. To train the classifier, the authors used a labeled dataset of videos.
  1. Very Deep Convolutional Networks For Large-Scale Image Recognition
[6] Simonyan, Karen, and Andrew Zisserman. 2014. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv preprint arXiv:1409.1556.

  • This paper talks about the impact of convolutional neural network (ConvNet) depth in the context of large-scale image recognition. It explores the use of deep networks with small (3×3) convolutional filters and shows that increasing network depth improves performance.
  • They found that using small (3×3) convolutional filters and pushing the network depth to 16-19 layers can lead to a huge improvement. 
  • This paper focuses on ConvNets architecture and explores increasing the depth. 
  • During the paper, the authors argue their decisions about input size, convolutional layers, etc. The training process involves optimizing the multinomial logistics regression objective using mini-batch gradient descent with momentum. Regularization techniques such as weight decay and dropout are used.
  • To test, they used a trained ConvNet as input images for classification. The input is rescaled to a predefined smallest side as Q, which is a test scale.
  • The network is applied densely over the rescaled test image. The fully connected layers are converted to convolutional layers, and the result is a fully convolutional network is applied to the whole uncropped image.
  • This paper, very similar to other papers, will provide me with a better understanding of using CNN models on a big data set of videos. I want to review this paper to study how they worked increasing the depth of the layers.

2. Pitch #2: Machine learning for music recognition: 

This project aims to use machine learning for music genre recognition. In a world where music is everywhere, music recognition can be a bit of a challenge. I will use machine learning techniques to teach the system to recognize unique patterns and characteristic that defines the music genre.  I may try to duplicate already well-known models for music recognition and compare two of them to see which one is more efficient. 

Research question: To optimize music genre recognition using machine learning, can we compare and evaluate the performance of traditional machine learning classifiers, such as logistic regression and support vector machines (SVMs), with deep learning architectures like convolutional neural networks (CNNs) using a diverse dataset? How do these algorithms perform in terms of accuracy?

Data sets:
  1. A Machine Learning Approach to Musical Style Recognition
[1] Dannenberg, Roger B, Belinda Thom, and David Watson. 1997. “A Machine Learning Approach to Musical Style Recognition.” Carnegie Mellon University.

  • This article is about the applications of machine learning for music recognition. The main problem of this article is the challenge of making computers understand and perceive music beyond low-level features like low-level and tempo.
  • There is a lot of avoidance in research about high-level inference, so it is a challenge to build music-style classifiers. 
  • This article proposes the development of a machine learning classifier for music style recognition. 
  • The data collection in this research is by recording trumpet performances in different styles and then labeling them according to the style. 
  • The machine learning techniques used are naive Bayesian classifier, linear classifier, and neural networks to build the style classifier. 
  • They trained the classifier a portion with the data and then the rest of the data.
  • There are a lot of challenges because music can be multifaceted. Also, selecting features by trying to capture the essence is not easy.
  • There are a lot of overlapping styles in music.
  • The training of the data can be time-consuming and requires a lot of human effort.
  • This article could be relevant to my work to provide context and framework. 
  1. Music Genre Classification using Machine Learning Algorithms: A comparison
[1] Chillara, Snigdha, AS Kavitha, Shwetha A Neginhal, Shreya Haldia, and KS Vidyullatha. 2019. “Music Genre Classification Using Machine Learning Algorithms: A Comparison.” International Research Journal of Engineering and Technology 6, no. 5 (2019): 851-858.

  • They built multiple classification models and trained them over the Free Music Archive (FMA) dataset without hand-crafted features. 
  • To be able to classify the genre of a song, previous work had used both Neural Network (NN) and machine learning. To use NN, this has to be trained end to end using spectrograms of the audio signals. Machine Learning algorithms, like logistic regression and random forest, use hand-crafted features from the time and frequency domains. The manually extracted features like Mel-Frequency Cepstral Coefficients (MFCC), Chroma Features, Spectral Centroid, etc., are used to classify the music into its genres using ML algorithms like Logistic Regression, Random Forest, Gradient Boosting (XGB), Support Vector Machines (SVM). The VGG-16 CNN model gave the highest accuracy.
  • With deep learning algorithms, we can achieve the task of music genre classification without hand-crafted features. 
  • Convolutional neural networks (CNNs) are a great choice for classifying images. The 3-channel (R-G-B) matrix of an image is given to a CNN, which then trains itself on those images. 
  • In this project, the sound wave can be represented as a spectrogram, which can be treated as an image as spectrograms using Short Time Fourier Transform (STFT). CNN will process the spectrograms by capturing patterns. There are a lot of elements that will be extracted: statistical moments, zero-crossing rate, root mean square energy, tempo, etc. 
  • CNN accuracy was about 88%, but models such as LR and ANN may have higher accuracy.
  • The used dataset is from a Free Music Archive dataset. 
  • This paper is relevant to my project because provides a very detailed methodology in the research that I may use as guidance. 
  1. Music instrument recognition using deep convolutional neural networks
[1] Solanki, Arun, and Sachin Pandey. 2022. “Music Instrument Recognition Using Deep Convolutional Neural Networks.” International Journal of Information Technology 14, no. 3 (2022): 1659-1668.

  • Identifying musical instruments within many instruments is very complicated. The research uses a deep CNN to try to achieve this task. The music data set of the instrument is labeled and entered into the network and can estimate multiple instruments from audio signals of many lengths. 
  • Mel spectrogram representation is used to convert audio data into matrix format.
  • The neural network in this research is formed of 8 layers. The softmax function is also used to provide higher chances of identification. 
  • They use CNN for its convolutional layers, pooling layer, etc.
  • This article could be relevant to my project because the techniques for the extraction of relevant data from the music data set could help me in editing my data. Also, the information about deep learning, including CNN could be relevant as a guide. I could explore a similar research but using a different model.
  • The conversion of audio data into spectrograms is also very needed information for my project. 
  1. Music genre classification and music recommendation by using deep learning
[1] Elbir, Ahmet, and Nizamettin Aydin. 2020. “Music Genre Classification and Music Recommendation by Using Deep Learning.” Electronics Letters 56, no. 12 (2020): 627-629.

  • This paper talks about the importance of music in people’s lives and the need to classify music by genre. 
  • This paper reviews preview work on music classification, including Time-Frequency analysis, Mel Frequency Cepstral Coefficients, wavelet transformations, and support vector machines but the authors introduce a convolutional neural network for extracting features from raw music spectogram and mel scpectogram. They compared the performance of CNN-based methods with traditional processing methods.  
  • In this paper, the author proposes MusicRecNet, a new CNN-based model for music genre classification.
  • They claimed that this model outperformed their previous classifier. 
  • The dataset used in this research is the GTZAN dataset, which contains 1000 music samples from ten genres to evaluate the model.
  • Each music sample is divided into six 5-second parts and generates Mel Spectograms. MusicRecNet, with three layers and additional features such as dropout, is trained on these spectrograms. 
  • They used various classification algorithms such as MLP, logistic regression, random forest, LDA, KNN, and SVM applied to the vectors. 
  • The accuracy of MusicRecNet is 81.8%, but when used with SVM, it is 97.6%. 
  1. Music Genre Classification: A Comparative Study between Deep-Learning and Traditional Machine Learning Approaches
[1] Lau, Dhevan S, and Ritesh Ajoodha. 2022. “Music Genre Classification: A Comparative Study Between Deep Learning and Traditional Machine Learning Approaches.” In Proceedings of the Sixth International Congress on Information and Communication Technology: ICICT 2021, London, Volume 4, 239-247. Springer.

  • This paper compares the deep learning convolutional neural network approach with five traditional off-the-self classifiers using spectrograms and content-based features. This experiment uses GTZAN dataset, and the result is of 66% accuracy.
  • The paper introduces the importance of music and the role of genres in categorizing music. Music genre classification is identified as an Automatic Music Classification problem and part of Automatic Music Retrieval. 
  • The study uses automatic music genre classification using spectrogram images and content-based features extracted from audio signals. It uses deep learning CNN and traditional classifiers such as logistic regression, k-nearest neighbors, support vector machines, random forest, and multilayer perceptrons. 
  • The dataset has 1000 songs and is 30 seconds long. It includes raw audio files, extracted mel frequency cepstral coefficients spectrograms, and content-based features. 
  • To train the data, the data set was duplicated and divided into 10,000 3-second song pieces. 
  • The spectrogram size was 217×315 pixels, and 57 features were selected, such as chroma short time Fourier, root mean square error, spectral centroid, Harmony, Tempo, zero crossing rate, etc. 
  • Then CNN was used for deep learning, which consists of an input layer followed by five convolutional blocks. Each block had specific layers. They used a 2D matrix in a 1D array. 
  • The traditional machine learning models were implemented by using Scikit Learn library. 
  1. Multimodal Deep Learning for Music Genre Classification
[1] Oramas, Sergio, Francesco Barbieri, Oriol Nieto Caballero, and Xavier Serra. 2018. “Multimodal Deep Learning for Music Genre Classification.” Transactions of the International Society for Music Information Retrieval 1 (1): 4-21. Ubiquity Press.

  • This article discusses music genre classification using a multimodal deep-learning approach. They aim to develop a system that can automatically assign genre labels to music based on different types of data, including audio tracks, text reviews, and cover art. 
  • The authors proposed a 2 step approach: 1) to train a neural network for each modality (audio, text, etc.) on the genre classification and extract intermediate representations from each network and combine them in a multimodal approach.
  • Audio representation is learned from audio spectrograms using CNN,  text from music-related texts using a feedforward network, and visual uses a residual network. 
  • The model uses weight matrices and hyperbolic tangent functions to embed audio and visual representations into the shared space. 
  • The dataset used is The million song dataset, which consists of metadata. The data is split into training, validation, and test sets.
  • According to the authors, combining the three modalities outperforms individual modalities. 

CS388 Fall2023 Three Pitches

with No Comments

Saki Pitches

  1. Data visualization (information visualization)
    I would like to create data visualizations about the rate of females in STEM areas in different countries /eras. One of the visualizations I can think of is a map visualization since I would like to focus on areas. I specifically would like to explore the cause why Korean females tend to go to STEM areas in college more than Japanese females, focusing on the education system, role models, and some other social facts.
    Data Sets Examples
  2. Chatbot
    I will create a chatbot which prospective students and current students can ask about Earlham College. It will be great research about Human Computer Interaction. A difficult point might be how to get the information about Earlham to the database which the chatbot will base on. I am also interested in designing its webpage at the same time.
  3. Evaluation and Redesign of Instagram
    As a computer science major, I am particularly interested in web development. The problem that I found interesting to solve was how people can  live with social media without much stress. I always think that social media such as Instagram now becomes a platform for people to “show off” part of their lives in some ways. I can do research on what kinds of User experience effects on humans mental health. I could evaluate the app based on Use experience aspects. I can work on researching user experience principles and its effects.

Andex Nguyen – Covid-19 Reception Rate in South-East Asian Countries

with No Comments

About Me:

I am Andex Nguyen, a Senior who’s majoring in Data Science.


Since 2020 when the pandemic struck us for the first time, Covid-19 has always been a globally noteworthy issue. In this project, the questions that I am trying to find the answers for are how different countries in 1 region, specifically South-East Asian Countries, responded to the Vaccine, and what may be the reasons for that difference. A lot of the time, the matter of the people not having vaccinated stem from the political issues, economic and medical ability to acquire the vaccines, and the belief systems of those countries. My country, Vietnam, did very well in preventing Covid-19 by executing strong quarantine policies for our first step, and we proceeded to excellently secure an adequate amount of vaccines for our people. I want to see if the countries around the same area as my country also did a similarly good job. There would also be a case difference between fully vaccinated and those who are still waiting to complete their vaccination process.

Data Architecture Diagram:

GitLab link:


with No Comments

Uprisings are an act of resistance/rebellion. Uprisings could be the general population rising against the government (such as the protests in Zimbabwe as people protest over the poor economic conditions) or it could be people with morally questionable intentions against the general populace such as the Boko Haram in Nigeria. It is important to predict when uprisings are going to occur for the sake of peace and safety of everyone. This leads to the question of how we can possibly know before a social upheaval begins. Currently there is no official way to predict social unrest or a protest; people only know when a protest has already started. For this research, I will focus on using machine learning to study the effects of economic trends causing social uprisings. I use the trained machine learning model to predict when social upheaval’s are likely to happen. This model will be continually improved by doing exploratory data analysis to see which financial metrics are more reflective of the economy at any given time and using those metrics as input into the model. Continuously improving the accuracy of the model means we get closer to accurately predicting social unrest consequently ensuring the peace and safety of the public in the event of a protest


with No Comments

Hi, my name is Overpower Gore but everyone calls me OPG. I am a senior Data Science student at Earlham College with a passion for information technology. I have held multiple internships impacting positive organizational outcomes through software development, data analysis, and data visualization.

CS 388 Annotated Bibliographies

with No Comments


I am interested in completing research related to the use of machine learning to generate decision trees which control non-player character behavior in video games. Decision trees are relatively interpretable and have a high-level correspondence to behavior trees, which are used in the most common approach to AI for video games (Świechowski, 2022). I also enjoyed working with them when I took Artificial Intelligence and Machine Learning. The two environments I would propose using for testing is the Micro-Game Karting template in the Unity Asset Store, which is available for free and was used for testing by Mas’udi (2021).

Annotated Bibliography

Chan, M. T., Chan, C. W., & Gelowitz, C. (2015). Development of a Car Racing Simulator Game Using Artificial Intelligence Techniques. International Journal of Computer Games Technology, 2015.

  • Has focus on emulating human behavior
  • Use of waypoint system combined with conditional monitoring system
  • Uses trigger detection because it detects objects within the trigger area, where ray-casting can only detect collisions in one direction
  • Unity’s waypoint system and built-in gravity and vector calculations minimize development effort

This paper argues that Unity, as a development platform for a racing game, has advantages over the use of traditional AI search techniques for implementation of path-finding. These points could help to justify a decision to utilize Unity as an environment for testing.

Fairclough, C., Fagan, M., Mac Namee, B., & Cunningham, P. (2001). Research Directions for AI in Computer Games. Department of Computer Science, Trinity College, Dublin.

  • Covers the role of AI in action games, adventure games, role-playing games, and strategy games
  • Describes the status of AI within the game industry at the time of its publication
  • Lists a wide range of benefits to exploring the field of game AI
  • Discusses the TCD Game AI Project

This article gives a number of reasons for the simplicity of AI used by game developers in comparison to that used in academic research and industrial applications, and mentions a number of well-understood techniques in wide use at the time of its publication, including FSMs and A*. It also offers a number of reasons why research into AI for computer games is a worthwhile endeavor.

French, K., Wu, S., Pan, T., Zhou, Z., & Jenkins, O. C. (2019, May). Learning Behavior Trees from Demonstration. In 2019 International Conference on Robotics and Automation (ICRA) (pp. 7791-7797). IEEE.

  • Focuses on how a behavior tree can be learned from demonstration
  • Understands behavior trees as modular, transparent, reactive, and readily executable
  • The control nodes of behavior trees are Sequence, Fallback, Parallel, and Decorator
  • Algorithm 1 naively converts a decision tree to a behavior tree, while BT-Express takes advantage of the properties of a decision tree in doing so
  • Robot uses behavior tree to execute task without human input
  • Task executed is dusting with a micro fiber duster, requiring six distinct steps

This article gives advantages of behavior trees and describes two methods by which a decision tree can be converted to a behavior tree. This could be helpful in understanding the relationship between the decision trees which I intend to work with and the behavior trees that are prevalent in video game AI.

Guo, X., Singh, S., Lee, H., Lewis, R. L., & Wang, X. (2014). Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. Advances in Neural Information Processing Systems, 27.

  • Environment used is the Arcade Learning Environment (ALE), a class of benchmark reinforcement learning (RL) problems
  • Results for Deep Q-Network used as a basis
  • Repeated incremental planning via the Monte Carlo Tree Search method UCT
  • Combines UCT-based RL with deep learning by using a UCT agent to provide training data
  • Tests three methods, the most successful of which trains a classifier-based convolutional neural network (CNN) through interleaving of training and data collection to obtain data for an input distribution bearing a closer resemblance to that which would be encountered by the trained CNN

This paper presents results from game-playing agents whose performance was tested on the ALE, an environment which includes a variety of games for the Atari 2600 and thus would allow for testing the performance of decision trees on a range of different games, as well as permitting straightforward evaluation of the agent’s performance through scores achieved on the seven games used to evaluate Deep Q-Network. It also offers an example of the use of a method that would be infeasible for real-time play to generate training data for a faster agent, which, similar to the use of machine learning to generate a decision tree, uses a time-intensive process to create a classifier which can make decisions in real time.

Mas’udi, N. A., Jonemaro, E. M. A., Akbar, M. A., & Afirianto, T. (2021). Development of Non-Player Character for 3D Kart Racing Game Using Decision Tree. Fountain of Informatics Journal, 6(2), 51-60.

  • Waypoint system and raycasting used for pathfinding
  • Decision trees used with waypoint system and raycasting to determine NPC actions
  • Performance tested through FPS (frames per second), lap time over ten laps, and a driving test based on collisions with walls over ten laps
  • Agent’s performance compared to that of the ML NPC provided with Karting Microgame

This article offers an example of the use of decision trees to control NPC behavior. The environment it uses, Micro-Game Karting, which is now known as Karting Microgame, also offers an accessible environment for testing the performance of decision trees.

Van Lent, M., Fisher, W., & Mancuso, M. (2004, July). An Explainable Artificial Intelligence System for Small-Unit Tactical Behavior. In Proceedings of the national conference on artificial intelligence (pp. 900-907). Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.

  • Full Spectrum Command, a commercial platform training aid developed for the U.S. Army, includes an Explainable AI system
  • In the after-action review, a soldier within the game can answer questions about the current behavior of its platoon, and the AI system identifies key events
  • Gameplay resembles the Real Time Strategy genre most closely, though with a number of differences
  • NPC AI controls individual units, while Explainable AI works during the after-action phase
  • NPC AI is divided into Control AI, which handles immediate responses and low-level decision-making, and Command AI, which handles higher-level tactical behavior

This paper offers an example of one approach to transparency of video game AI, that of creating a distinct system to explain its action in retrospect, as well as a different context in which explainability is relevant. It could be compared with the use of behavior trees and decision trees to allow for transparency in video games.


I wish to complete research related to the use of databases in video games to improve the efficiency of computations by increasing data locality and taking advantage of the optimizations of database management systems. While the article and dissertation by O’Grady (2019, 2021)  which are referenced below both appear to focus on the feasibility of that approach, it seems likely that additional exploration is warranted regarding demonstrable advantages. For this exploration, I would propose focusing on the execution of path-finding through a database management system, one of the main areas examined by O’Grady (2021). Specifically, I wish to focus on sub-optimal path-finding, since optimality is generally not essential in the context of video games (Botea, 2013), despite the widespread use of A* (Kapi, 2020).

Annotated Bibliography

Abd Algfoor, Z., Sunar, M. S., & Kolivand, H. (2015). A Comprehensive Study on Pathfinding Techniques for Robotics and Video Games. International Journal of Computer Games Technology, 2015.

  • Covers techniques for known 2D/3D environments and unknown 2D environments, represented through either skeletonization or cell decomposition
  • Grids can be divided into regular grids and irregular grids, both of which are widely used in video games
  • As an alternative to grids, hierarchical techniques are used to reduce memory space needed.

This article offers an overview of the various pathfinding techniques used for video games and robotics. It is organized by environment representation and could be helpful in comparing and evaluating path-finding algorithms in order to determine which to use with a specific representation. It might also assist in deciding on an environment representation.

Bendre, M., Sun, B., Zhang, D., Zhou, X., Chang, K. C., & Parameswaran, A. (2015, August). Dataspread: Unifying databases and spreadsheets. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases (Vol. 8, No. 12, p. 2000). NIH Public Access.

  • Data exploration tool called DataSpread
  • Relational database system used is PostgreSQL
  • Offers Microsoft Excel-based front end while managing all the data in a back-end database in parallel
  • Spreadsheet design focuses on presentation of data, while databases have powerful data management capabilities

This paper describes a system created by unifying relational databases and spreadsheets, with the goal of retaining the advantages of spreadsheets while taking advantage of the power, expressivity, and efficiency of relational databases. The objective and approach of this research bear significant resemblances to those of O’Grady’s 2021 dissertation, in that both seek to increase the efficiency of a system by handling costly computations in a relational database. However, this article differs from O’Grady’s work in that it is based on a spreadsheet interface and an underlying database, rather than focusing on the implementation of specific components of a system within a database management system. 

Botea, A., Bouzy, B., Buro, M., Bauckhage, C., & Nau, D. (2013). Pathfinding in games. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

  • Explains A*, Manhattan distance, and Octile distance
  • Discusses alternative heuristics such as the ALT heuristic, the dead-end heuristic, and the gateway heuristic
  • Covers approaches involving hierarchical abstraction
  • Discusses symmetry elimination, triangulation-based path-finding, and real-time search
  • Covers compressed path databases and multi-agent path-finding
  • Discusses a range of potential areas for further research, including the generation of sub-optimal paths

This paper explains the logic and advantages of various approaches to path-finding in video games. It also recognizes that in video game contexts, suboptimal paths are acceptable

O’Grady, D. (2019). Database-Supported Video Game Engines: Data-Driven Map Generation. BTW 2019.

  • Constructs maps composed of tiles through use of rules applied at specific frequencies
  • Map generation implemented in SQL and run on PostgreSQL 10
  • OpenRA used for on-site demonstration

This paper provides a demonstration of how a portion of a game’s internals can be moved to a database system. Its goal is to show that the benefits of moving expensive computations to a database system can be obtained while still allowing straightforward modification of the process by game developers.

O’Grady, D. (2021). Bringing Database Management Systems and Video Game Engines Together (Doctoral dissertation, Eberhard Karls Universität Tübingen).

  • Implementing data-heavy operations on the side of the database management system (DBMS) increases data locality
  • DBMS benefit from decades worth of development and optimization
  • Explores how databases can be involved in the operation of video game engines beyond their role in data storage
  • Evaluates practicality of three video game engine components when implemented in SQL
  • The third of these components is path-finding
  • The path-finding algorithm investigated is A*.
  • Storing spatial information about the game world in a DBMS allows for easy implementation of additional constraints to avoid collisions upon planning, preliminary reduction of the search space, preparatory analysis of the search space, and performing the path search close to the data
  • The pgRouting extension, implemented in C++ is used as a baseline
  • Two other custom implementations are also explored, both of which expect vertex-weighted graphs
  • The map is assumed to already be in a relational representation, as is the ability of actors to pass over certain types of tiles
  • Finding unilaterally connected components allows path-finding to be sped up in some cases
  • An implementation of A* in pure SQL is included
  • An implementation of Temporal A*, designed to avoid collisions, in SQL is also included
  • An Iterative Path Finding implementation in SQL, which gradually finds paths for multiple agents, is also given
  • Path-finding has to be completed in a timely manner, but players are willing to accept higher latencies for it compared to other actions
  • Shorter paths were found faster using the native implementation of OpenRA
  • As paths increased in length, the time needed to find them in the native implementation gradually increased, while the time needed to find them using pgRouting remained about the same
  • 57% of path searches were found to be within the range of lengths that resulted in agreeable time boundaries
  • The DBMS was found to offer means to find paths sufficiently fast

This dissertation is the main work which I intend to build on in my thesis. Specifically, I wish to further explore the advantages of implementing path-finding in a database management system, as covered in Chapter 4.

Kapi, A. Y., Sunar, M. S., & Zamri, M. N. (2020). A review on informed search algorithms for video games pathfinding. International Journal, 9(3).

  • Challenges are limited memory, time constraints, and path quality
  • Most prominent algorithm is A*
  • Popular graph representations are grid maps, waypoint graphs, and navigation meshes, each of which has advantages and disadvantages depending on the type of game
  • NavMesh is often used in optimization to reduce memory consumption
  • Heuristic functions are generally modified to reduce time usage
  • Hybrid path-finding algorithms have been used for optimization
  • Data structure used for implementation of an algorithm can also be changed for optimization

This paper provides an overview of various ways of optimizing informed search algorithms for path-finding in video games, presenting various factors which influence memory and time usage. It would be useful for considering potential consequences of different experiment designs, given its focus on approaches to improving the efficiency of path-finding which do not rely on the optimizations of a database management system.

CS 388 Initial Pitches

with No Comments


I am interested in completing research related to the use of machine learning to generate decision trees which control non-player character behavior in video games. Decision trees are relatively interpretable and have a high-level correspondence to behavior trees, which are used in the most common approach to AI for video games (Świechowski, 2022). I also enjoyed working with them when I took Artificial Intelligence and Machine Learning. The environment I would propose using for testing is the Micro-Game Karting template in the Unity Asset Store, which is available for free and was used for testing by Mas’udi (2021).


Chan, M. T., Chan, C. W., & Gelowitz, C. (2015). Development of a Car Racing Simulator Game Using Artificial Intelligence Techniques. International Journal of Computer Games Technology, 2015.

Guo, X., Singh, S., Lee, H., Lewis, R. L., & Wang, X. (2014). Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. Advances in Neural Information Processing Systems, 27.

Mas’udi, N. A., Jonemaro, E. M. A., Akbar, M. A., & Afirianto, T. (2021). Development of Non-Player Character for 3D Kart Racing Game Using Decision Tree. Fountain of Informatics Journal, 6(2), 51-60.

Świechowski, Maciej and Ślęzak, Dominik, Monte Carlo Tree Search as an Offline Training Data Generator for Decision-Tree Based Game Agents (2022). BDR-D-22-00241, Available at SSRN: or


I am considering completing research related to the use of databases in video games for purposes apart from data storage. While the article and dissertation by O’Grady (2019, 2021)  which are referenced below both appear to focus on the feasibility of that approach, it seems likely that additional exploration is warranted regarding demonstrable advantages. For this exploration, I would propose focusing on the execution of path-finding through a database management system, one of the main areas examined by O’Grady (2021).


Muhammad, Y. (2011). Evaluation and Implementation of Distributed NoSQL Database for MMO Gaming Environment (Dissertation, Uppsala University).

O’Grady, D. (2021). Bringing Database Management Systems and Video Game Engines Together (Doctoral dissertation, Eberhard Karls Universität Tübingen).

O’Grady, D. (2019). Database-Supported Video Game Engines: Data-Driven Map Generation. BTW 2019.

Jovanovic, R. (2013). Database Driven Multi-agent Behaviour Module (Thesis, York University).


I am also considering completing research related to the use of machine learning for recognition of Kuzushiji, an old style of Japanese cursive writing. Kuzushiji mainly appear in works from the Edo period of Japanese history and are difficult to identify correctly due to their lack of standardization (Ueki, 2020). Another difficulty comes from the Chirashigaki writing style, in which text is not written in straight columns (Lamb, 2020). The Center for Open Data in the Humanities has released a dataset of them (Ueki, 2020), which is available at


Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., & Ha, D. (2018). Deep Learning for Classical Japanese Literature. arXiv preprint arXiv:1812.01718.

Clanuwat, T., Lamb, A., & Kitamoto, A. (2019). KuroNet: Pre-Modern Japanese Kuzushiji Character Recognition with Deep Learning. arXiv preprint arXiv:1910.09433.

Lamb, A., Clanuwat, T., & Kitamoto, A. (2020). KuroNet: Regularized Residual U-Nets for End-to-End Kuzushiji Character Recognition. SN Computer Science, 1(3), 1-15.

Ueki, K., & Kojima, T. (2020). Feasibility Study of Deep Learning Based Japanese Cursive Character Recognition. IIEEJ Transactions on Image Electronics and Visual Computing, 8(1), 10-16.

3 Pitches

with No Comments

Machine Learning and Dueling:

A fighters skill comes from their training, their learned experience through failure, and thousands of reps. This is also exactly how reinforcement learning agents gain their skills– I propose a mutli-agent learning experiment where we attempt to teach two agents how to fight with swords and shields.

There are a couple common issues that hinder how AI’s learn in a 3D space. First, multi-agent learning (learning with two or more agents) is notoriously challenging. This is made difficult because an agent not only must learn how to move in general, but also move around another agent. As of 2017, a recent algorithm called MAD-DPG has proposed a solution to this problem, but it’s still buggy and could present issues. Additionally, without proper motion capture data, the agents will learn incorrectly, most commonly seen in issues like body parts artifacting through the ground. Finally, while I have motion capture data for individual fighting movements, such as slashing a sword or blocking with a shield, it is much more difficult to find motion capture data that of two humans sword fighting. Finally, swords and shields will be weighted, throwing off the model’s sense of balance, which will take lots of training to get used to.

Using Unity 3D machine learning agents made possible through PyTorch, CUDA and the unity facing engine, ML-agents, I propose  I propose creating a deep reinforcement learning agent. Our control strategy would have 2 parts, a pretraining stage (a deep learning model used to create a neural network to  solve problem) and a Transfer learning stage, which uses a prebuilt neural network from our pretraining stage to solve a problem.

Steps of pretraining:

  1. Feed the network a collection motion capture data of sword fighting sourced from Adobe Mixamo to understand basics, such as how to hold a sword and shield, how to walk with one, etc.
  2. This motion capture data is passed through a Imitation Policy Learning network, which tries to learns through imitation of the motion captured movement.
  3. The goal from this point is to pass the data is to translate our Imation Policy to a Comptetive Learning Policy, which learns through the goal of winning. To do this, we must preform two steps:
    1. Task encoding, adding understanding to goals of movements
    2. Motor decoding, decoding motor function in a way the competitive Learning Policy can understand

Steps of Transfer Learning:

  1. Turn the Motor Decoding and Task Encoding data into learning policies for both agents (Agent 1/Agent 2)
  2. Run these policies through Competitive Learning Policy, which mimics the way humans learn skill, through trail and error against eachother.

Social Simulation of Scaricity:

Scarcity–how do we divide resources fairly when their isn’t enough for everyone? The entire subject of Economics boils down to this question, and has caused many conflicts throughout history. Sometimes, we have to make tough decisions because of scarcity.

I propose an experiment in a simulated world, with two “tribes” of agents. The agents must eat once a day, which is done by collecting food blocks. Each agent has the ability to move, eat, collect food for their village, ask others to share, and kill eachother. Each agent’s main priority is to eat for as many days as possible. If they go a full day without eating, they starve to death.

Then, we test these tribes under different levels of scarcity:

  1. Plenty: There is enough food for both tribes.
  2. Moderate: There is enough food for 75% of agents, but one tribe will not have enough.
  3. Scarce: There is only enough food for 1 tribe.
  4. Unfair disruption: There is only enough food for 1 tribe, and 1 tribe controls all of it.

If the agents have plenty, will they remain peaceful? If there’s a moderate distribution, will the tribes share with eachother and lose the same number of members? Will they fight the other tribes for resources? If the resources are scare, will a tribe sacrifice members of their own tribe, or steal from the other tribe? Is fighting inevitable with an unfair disruption?

In order to do this experiment, we must allow for a level of teamwork. Each tribe has their own Competive Learning Policy which builds a nueral network, judging its success by how many tribe members are left after a given period of days. These networks may give tasks such as move, attack or collect food to any tribe member. 

Usually, two agents learning against eachother means slower learning, but because the neural network are essentially playing a simple strategy game against eachother, they’re learning might be accelerated due to the competition. 

Creating Unique Architecture with Machine Learning:

Necessity, the mother of invention. From the water carrying aqueducts of Rome to the houses stacked across the mountains of Tibet, people have made unique and intricate structures out of necessity. Typically, neural networks are trained with data sets, which they then attempt to imitate according to a metric of success. However, what they produce isn’t truely unique, it’s an imitation of what a researcher asks them to create. I propose an experiment to create entirely unique, AI generated 3D architecture by providing the necessity to build.

First, we must make environment simple enough for an neural network to learn to create in, but succifciently complex enough to cultivate interesting innovation. Unity 3D is ideal to create this enviroment, as it already has a well-developed physics engine, adding additional enviromental conditions is simple, allows for easy map creation, has many free assets to use, and supports machine learning.

Second, we must create necessity. Two agents will be created, a Creature and a Creator agent. The Creature is a simple bot, who’s only goal is to keep his three needs satisfied. The creature wants to stay warm, dry, and well fed. Warmth is decreased when “wind” comes in contact with our agent, dryness is decreased when rain comes in contact with our agent, and hunger is decreased by collecting food blocks.

The Creator agent has a simple goal, keep the Creature as warm, fed, and dry as possible. The Creator is a neural network with the ability to build, under the conditions of the physics engine. This neural network has permission to use a limited amount of two different resources: Wood and stone. Stone is much stronger then wood, but also much heavier. Wood is weaker, so it breaks under weight,  but also lighter and more accessible. Wood and stone come in blocks, and wood also comes in plank form, which is worse for walls but better for roofs. The Creator has the ability to “nail” materials together, which means the materials may be connected together through touching vectors, unless the weight of materials attached to it is too heavy to hold.

Beyond coming with a well developed and well tested physics engine, the beauty of the Unity engine is it’s ease of customimability. Because it’s a game engine, adding new conditions is made easy. Once our Creator network has learned to build basic structures, we can up the complexity of the structures by adding more needs. Possibilities include:

  1. Pooling water: If rain can collect in pools, structures may have to be constructed with slanted roofs
  2. Stronger winds: The stronger winds destroy less stable structures, structures may have to be built more sturdy.
  3. Creature needs sunlight: The need to have sunlight come into the house may lead to structures having windows, or interesting alternatives

The goal is to create unique and interesting structures, and a simulation made too simple will lead to uninteresting building. Unity 3D’s easy ability to add and remove these conditions will be vital to this experiment, as these new conditions may create innovate, or break and confuse our neural network.

Hailee Dang

with No Comments

About me

My name is Hang “Hailee” Dang. I am a senior majoring in Data Science with interest in Finance, Data and Business Analytics.

My project

  • Purpose: This paper aims to find a new metric on predicting a technology company’s stock performance besides 5 fundamentals metrics that have been proved and frequently used in corporate valuation.
  • Design/ MethodologyThe paper explains reason of choosing the 5 fundamental metrics using existing papers and studies. Then, using data from online financial database system with R to evaluating the correlation between these metrics and the actual performance over a 10-year period. The result will diverse from least to most correlated to the actual stock performance. New highly correlated metric such as media mentions or number of articles about the company should be added to the model to increase the correctness of the prediction.
  • Research limitations/implications: This theoretical model only applies to the time range from 2011-2021 and profited large corporations in technology industry. Any applications outside this data frame would not be reflected correctly.

Architecture diagram

Gitlab project

Software demonstration video

Introduction to my project

with No Comments

Hello! My name is Tamara and I am a senior Computer Science Major at Earlham College and this is my capstone project. I am interested in applying machine learning and other computational techniques for better understanding of social issues. Here is the abstract of my capstone project proposal.

Unmanned Aerial Vehicles (UAVs) have only recently been applied
in the detection of clandestine graves. With the growing technical
abilities of UAVs, more opportunities are available in methods
for recording data. This opened many doors for detecting hidden
graves because now many different sensing devices can be used
in remote areas that were inaccessible to humans. This study will
compare the use of hyperspectral and RGB images taken by drones
to detect hidden graves. It is aiming to compare how accurately
the graves can be detected post-image processing using the same
technique. The primary motivation is to help future researchers
more easily decide which data collection technique they should use.
The processing technique used on both data sets is a model with
an edge detection algorithm.


with No Comments

Hello! My name is Tamara and I am a senior Computer Science Major at Earlham College and this is my capstone project. I am interested in applying machine learning and other computational techniques for better understanding of social issues. Here is the abstract of my capstone project proposal.

Unmanned Aerial Vehicles (UAVs) have only recently been applied
in the detection of clandestine graves. With the growing technical
abilities of UAVs, more opportunities are available in methods
for recording data. This opened many doors for detecting hidden
graves because now many different sensing devices can be used
in remote areas that were inaccessible to humans. This study will
compare the use of hyperspectral and RGB images taken by drones
to detect hidden graves. It is aiming to compare how accurately
the graves can be detected post-image processing using the same
technique. The primary motivation is to help future researchers
more easily decide which data collection technique they should use.
The processing technique used on both data sets is a model with
an edge detection algorithm.

Project intro

with No Comments

My name is Khoa, and I am a senior, double-major in Quantitative Economics and Data Science. My tentative project focuses on the application of Machine Learning to study the physical-chemical properties of different molecule structures, which has been proven useful in many fields like drug discovery, materials science, biology, etc.

Project Intro

with No Comments

Hi everyone, I am Devin. I’m a Data Science major and a senior.

For my project, I am creating a win probability model for the National Hockey League. Hockey has been one of my favorite sports since I was a little kid and applying statistical concepts to one of my favorite sports excites me. Ultimately, I would like to compare the model that I build to other win probabilities for the NHL. The model accuracy would be assessed by how well it can predict which team wins the game.

Project Introduction

with No Comments

I am Tra-Vaughn James, a Computer Science major, and Junior:

Today, Bioinformatics is an evolving field, in which computing resources have become more powerful, readily available and workflows have increased in complexity. New workflow management tools (WMT) attempt to develop software that fully harnesses this computational power, creating intuitive implementations utilizing machine learning techniques. This streamlines the design of complex workflows. However, overarching problems still remain that newer workflow management tools do not fully address: they are too specific to particular use cases, and they present a great learning curve for users unfamiliar with computing environments. Many implementations require one to spend copious amounts of time understanding the tool and adjusting already existing frameworks to ones needs, creating frustration and inefficiency. This problem is experienced by both novice and experienced bioinformaticians alike. Using OpenWDL, a Workflow Description Language, as the basis, I seek to develop an open in use workflow management tool coupled with a GUI interface. As OpenWDL is a widely known WMT, its familiarity will aid in my implementations usability. Additionally, the GUI interface will present a more welcoming environment than that of a command line interface in which many WMT’s often employ. To assess the effectiveness of my implementation, I will then assess it to other WMT’s, comparing its usability and openness to other bioinformatic pipeline managers such as SnakeMake and NextFlow.

Research on Low-Cost Portable Air-Quality Sensor System

with No Comments


A low-cost portable sensor system allows users to monitor the different components of the air quality around them. There is a need for a sensor system like this because of the millions of deaths and diseases that occur every year due to air pollution worldwide. My planned contribution is for the prototype sensor system I design and build to be as DIY (do it yourself) and low-cost as possible while still being usable in a theoretical online network for large-scale pollution mapping in real time. I will program the sensors together and investigate the calibration of the sensors because they can fall out of calibration after extended periods of time. I will evaluate the results of my experiment of building and using the sensor system by: the robustness of the system indoors and outdoors, analyzing the repeatability of the experiment, accessing how the system could further be improved for ease-of-access to users financially, testing system portability, and the practicality of the theoretical network that could be made with multiple sensor-systems.

Software Diagram

Link To GitLab Project

Link To Software Demonstration Video

Three Pitches

with No Comments

Pitch 1: Website that creates tangible, engaging visualizations of probabilities.

Humans are poor at conceptualizing probability. I notice that one common reason anti-vaxxers use for refusing the vaccine is that they’d rather take their chances with catching covid than get the shot and come down with complications, which is far rarer than the chance of dying from covid. The objective of this project would be to create a website with high user engagement that would inform users of the odds of certain events happening using tangible, engaging visualizations. The user would be able to select from a list of events to compare. Visualizations that represent a dangerous event that is likely to occur will look visually more dangerous than those that are less likely to occur.

I have found some potentially useful datasets to use



  • What kinds of graphics would help reduce the worry that something might happen? Colors, structure, font, etc.
  • Do moving images impact the user more than static images? Do they keep users on the site?
    • Note: I am having a hard time finding existing sources, and I think my keywords I am using are wrong.

Some sources I found that may be useful/interesting: 

Pitch 2: Formality indicator for Japanese text

Google translate and Deepl don’t really give users the option to indicate what context a user needs to use Japanese in. One key thing that distinguishes different styles of speech or written communication is the choice in vocabulary. On Jisho, an online Japanese dictionary, some words are marked with what context they are used in. i.e. colloquial, slang, sonkeigo,(Honorific or respectful language). My proposed project would use web scraping from online dictionaries to gather data about which words are used in which context and would tell the user what level of formality the text is in and who it would be appropriate to say/send to.

Ana sent me a list of tools that may be useful. Sudachipy seems promising

Other sources

Pitch 3: Visualization of cause of death

I feel like death statistics are inherently somewhat dehumanizing, and I want to use recent death statistics to create a more human representation. The user would be able to enter certain demographics, such as location, age, sex, etc. This would be pretty similar to pitch one, but I would want to represent each person with something that people would empathize with. Perhaps instead of using the data outright, I would gather patterns from the data to create a fake population. I’m not sure how I might do this


  • What kinds of images do people empathize with? Would it be better to go with something more amorphous, such as blobs with cute faces or stick figures?


  • Keeping things tactful while tackling a sensitive issue. 

I have found several sources for cause of death data on kaggle, but not all are completely recent, which is ideally I am looking for. I did find recent data specifically for Brazil. The CDC has some data as well, I just need to figure out how to get at it. 

2 Pitches with Bibliography

with No Comments

Pitch #1

The  use of computing resources allows the processing of biological data and computational analysis. However in order to conver this data into useful information requires the us of a large number oftools, parameters, and dynamically changing reference data. As a result workflow managers such asSnake and OpenWDL were created to make these workflow scalable, repeatable and shareable. However, many of these workflow managers offer ambiguity toward creating workflows often lacking the specificity many other workflows require. I plan on creating bioinformatics workflow in which can be specified to particular workflows.

Bioshake: A Haskell EDSL for Bioinformatics workflows

Justin Bedő. 2015. Experiences with workflows for automating data-intensive bioinformatics – biology direct. (August 2015). Retrieved January 9, 2022 from 

  • Bioshake raises many properties to the type level allowing the correctness of a workflow to be statically checked during compilation, catching errors before any lengthy executions process. Bioshake is buit on top of Shake, an industrial strength build tool, thus inheriting many of its reporting features such as “robust dependency tracking, and resumption abilities”
  • Paper explains that bioshake, is an EDSL for specifying workflows that compiles downt to an execution engine (Shake).

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers

Laura Wratten, Andreas Wilm, and Jonathan Göke. 2021. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. (September 2021). Retrieved January 9, 2022 from 

Paper highlights the key features of workflow manager and comapares commonly used approaches for bioinformatics workflows.

A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways

Martín Garrido-Rodriguez et al. 2021. A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways. (February 2021). Retrieved January 9, 2022 from

MIGNON is used for the analysis of RNA-Seq experiments.  Moreover, it provides a framework for the integration of transcriptomic and genomic data based on a mechanistic model of signaling pathway activities that allows  for a biological interpretation of the results, including profiling of cell activity. Entire pipeline was developed using the Workflow Descriptions Language (OpenWDL). All the steps of the pipeline were wrapped into WDL tasks that were designed to be executed on an independent unit of containerized software by using docker containers, which prevent deployment issues.Paper is an excellent source of seeing how WDL performs as workflow management language and the various problems that can occur from it.

Planning Bioinformatics workflows using an expert system.

Xiaoling Chen and Jeffrey T. Chang. 2017. Planning Bioinformatics workflows using an expert system. (January 2017). Retrieved January 9, 2022 from 

  • Paper discusses a method to automate the development of pipelines, creating the Bioinformatics Expert System (BETSY). BETSY is a backwards-chaining rule-based expert system comprised of a data model that can capture the essence of biological data, and an inference engine that reasons on the knowledge base to produce workflows.  
  •  Evaluations within the paper found that BETSY could generate workflows that reproduce and go beyond previously published bioinformatic results.

ACLIMATISE: Automated Generation of Tool Definitions for bioinformatics workflows.

Michael Milton and Natalie Thorne. 2020. ACLIMATISE: Automated Generation of Tool Definitions for bioinformatics workflows. (December 2020). Retrieved January 9, 2022 from 

Paper presents aCLImatise which is a utility for automatically generating tool definitions compatible with bioinformatics workflow languages, by parsing command-line help output. This utility can be used withing our workflow to create tool definitions.Workflow definitions must be customized according to the use-case, however tool definitions simply describe a piece of software, and are therefore not coupled to a single workflow or context this aCLImatise will not have a hindrance on workflow creations.

SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines.

Samuel Lampa, Martin Dahlö Jonathan Alvarsson, and Ola Spjuth. 2019. SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines. (April 2019). Retrieved January 9, 2022 from 

  • SciPipe utilizes Dynamic scheduling allows new tasks to be parametrized with values obtained during the workflow run, and the FBP principles of separate network definition and named ports allow the creation of a library of reusable components.
  • Scipipe workflows are written as Go programs, and thus require the Go tool chain to be installed for compiling and running (Have to have some basic knowledge of Go). SciPipe assists in particular workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrication of downstream tasks. Implementations of Scipipe include “ Machine learning pipeline in drug discovery, Genomics cancer analysis pipeline, RNA-seq/transcriptomics pipeline

Using rapid prototyping to choose a bioinformatics workflow management system

Michael J. Jackson, Edward Wallace, and Kostas Kavoussanakis. 2020. Using rapid prototyping to choose a bioinformatics Workflow Management System. (January 2020). Retrieved January 9, 2022, from 

  • Paper describes RiboViz a package, however it is more specific to ribosome data and understanding or protein synthesis, however it is implemented in python.
  • Paper test a slew of workflow management systems providing comparisons and contrasts of various work flows.
  • As workflow management systems require that each data analysis step be wrapped in a structured way. RiboViz  uses these wrappers to decide what steps to run and how to run these, and takes charge of running the stps, including error reports.

Pitch #2

New technologies have been evolving to aid life within the home. Video door bells, cameras and smart devices make many tasks much simpler than they use to be. However, the threat of security and ensuring that those with malicious intent are unable to hack and harm your home network has also increased, a failure in security could expose all of your personal information. As a result of this many organizations provide VPN services that have been developed as a means to protect people from the dangers of malicious hackers and malware. However, these same VPNS come with some faults such as higher cost and limitations as dictated by the provider , and the fact that paid services place you in the hands of the operator and its various cloud/network providers with no certainty that these providers will not snoop around in your data.

A VPN server that a user can host on there local machine solves all of these aforementioned problems with the added benefit of the user being able to securly access and maintain there home network.The server will be held in a virtual machine and will allow the user to be in complete control of it and its functions. This will increase efficiency of the VPN as the user no longer has to go through the network of the provider. My goal is to automate and open-source this process creating an easy launchable VPN server an average user can easily launch and use to maintain access to their home network.While at the same time being capable being edited and changed by the user for more robust security. I seek to compare this to similar paid services identifying which is more secure for the user.

What is a VPN?

Paul Ferguson and Geoff Huston . 1998. What is a VPN. (April 1998). Retrieved January 7, 2022 from 

Paper defines what a VPN  is. Further describes different types of VPN’s such as Network Layer VPN’s how they are constructed and the underlying protocols and techniques used create one. Breaks down the various VPN’s in accordance to the TCP/IP protocol. Describes VPN concepts such as Controlled route leading and Tunnelling. Overall this paper is a good source for understanding the basics of what a VPN is aswell aas the types, and procedures to setup one.

Implementation and analysis ipsec-vpn on cisco asa firewall using gns3 network simulator

Dwi Ely Kurniawan1, Hamdani Arif1, N. Nelmiawati1, Ahmad Hamim Tohari1, and Maidel Fani1. 2019. Implementation and analysis ipsec-vpn on cisco asa firewall using gns3 network simulator. (March 2019). Retrieved January 8, 2022 from 

This paper provides an example of constructing VPN and testing it using a virtual setting in which is a similar approach in which I am thinking of using. It is built using GNS3 network simulator software and virtual Cisco ASA Firewall. The result shows that VPN network connectivity is strongly influenced by the hardware used as well as depend on Internet bandwidth provided by Internet Service Provider (ISP). In addition to the security testing result shows that IPSec-based VPN can provide security against Man in the Middle (MitM) attacks. However, the VPN still has weaknesses against network attacks such as Denial of Service (DoS) that causes the VPN server can no longer serve VPN client and become crashes.

Enhancing security and privacy in local area network with TORVPN using Raspberry Pi as access point

Mohamad AfiqHakimi Rosli. 2019. Ehancing security and privacy in local area network with TORVPN using Raspberry Pi as access point . (October 2019). Retrieved January 8, 2022 from 

Provides another method of utilizing VPN servers to protect one’s local network.

Involves the Tor routing technique providing an extra layer of anonymity and encryption.

Although this approach requires the use of Rasberry pie for its implementation it would eliminate the need for installation and configuration of software while also making such services accessible to others.

A Remote Access Security Model based on Vulnerability Management

Samuel Ndichu, Sylvester McOyowo, and Henry Wekesa. 2020. A remote access security model based on … – MECS press. (October 2020). Retrieved January 11, 2022 from 

  • Paper addresses significant vulnerabilities from malware, botnets, and Distributed Denial of Service (DDoS).
  • Propose a novel approach to remote access security by passive learning of packet capture file features using machine learning and classification using a classifier model.
  • They adopted network tiers to facilitate vulnerability management (VM) in remote access domains.
  • Performed regular traffic simulation using Network Security Simulator (NeSSi2) to set bandwidth baseline and use this as a benchmark to investigate malware spreading capabilities and DDoS attacks by continuous flooding in remote access.
  • Although paper offers other alternative to VPN it is still very important to look as the main preference of my pitch is to present a more secure VPN technology for private users if such can do a similar thing without the drawbacks it is important to analyze.

Client-Side Vulnerabilities in Commercial VPN’s

Bui Thanh, Rao Siddharth, Antikainen Markku, and Aura Tuomas. 2019. Client-side vulnerabilities in commercial vpns | springerlink. (November 2019). Retrieved January 11, 2022 from 

  • Paper studies the security of commercial VPN services.
  • Analyzes common VPN protocol and implementation on Windows, macOS, and Ubuntu. 
  • The results of the study found multiple configuration flaws allowing attackers ti, strip off traffic encryptionor to bypass authentication of the VPN gateway 
  • If commercial VPN’s have such flaws, this paper presents important ideas and fixes that I should apply to my own VPN to ensure maximum security.

Beyond the VPN: Practical Client Identity in an Internet with Widespread IP Address Sharing 

Yu Liu and Craig A. Shue. 2021. Beyond the VPN: Practical client identity in an internet with widespread IP address sharing. (January 2021). Retrieved January 10, 2022 from 

  • Paper examines “the motivations and limitations associated with VPNS’s and found that VPN’s are often used to simplify access control and filtering for enterprise services.
  • Provides an alternative approach to VPN use. Their implementation preserves simple access control and eliminate the need for VPN servers, redundant cryptography, and VPN packet headers overheads. The approach is incrementally deployable and provides a second factor for authenticating users and systems while minimizing performance overheads.

Research on network security of VPN technology

Zhiwei Xu and Jie Ni. 2021. Research on network security of VPN Technology. (May 2021). Retrieved January 11, 2022 from 

  • Paper describes that the main function of a VPN is to build a network tunnel in the public network using relevant encryption technology, which can allow for the transmission of data safely and prevent others from seeing. 
  • Paper analyzes an IPSec VPN which can realize remote access through the IPSec protocol.
  • Paper claims that the advantage of IPSec VPN is that it is a net-to-network networking method, which can establish multilevel networking, fixed networking mode, suitable for inter-institutional networking, and that users can have transparent access and do not need to log in.

CS 388 Pitches

with No Comments

Idea #1

Using machine learning to identify if a webpage is malicious, sometimes websites are blacklisted and that’s how they are identified as malicious but its cumbersome to do that for every website and constant new sites, use ML to identify malicious sites based on keyword density and improve upon existing methods. Other factors that could be used to identify the malicious website are URL length, website age, country of origin. Identifying the most important features to use for ML will be key to the project. A nuance I could add would be to identify the type of attack associated with the URL and rank its severity. Short URLs are a way that malicious attackers attempt to circumvent detection. Being able to expand short URLs in order to extract features could allow for current tools to be more effective.

Idea #2

Calculate expected goals of premier league soccer teams. Expected goals is commonly used as a predictor to help analysts identify skillful players and predict the winning team. There is a mass of datasets to use and techniques that could be analyzed for efficacy and improved upon. A possible nuance I could add is comparing expected goals of a player to their wages or expected goals to the teams’ total wage bill to find efficient teams.

Idea #3

Using machine learning to identify network attacks specifically DOS attacks. Most current methods use huge and cumbersome MIB databases. I would explore more efficient and less time and resource-consuming methods for classifying the data and identifying anomalies within network traffic. Data can be classified by where it comes from, to help determine if it may be malicious. There is less specific research on this topic as most of it is specific to a domain or the data is private.

CS388 Pitches

with No Comments

Idea 1 : Maze Generation

I would like to examine maze generation algorithms for the purpose of generating more challenging domains for search algorithms to solve. Creating domains that challenge existing search algorithms can assist in the development of more robust search algorithms that can avoid certain pitfalls of existing algorithms. Additionally, mazes are widely understood and have efficient state changes, which can allow for more algorithm based examinations in the future. I would like to develop a system for rating the “hardness” of a given maze, as well as creating a maze generation algorithm that can generate mazes that have a higher or lower “hardness” rating. 

Idea 2 : Cave Generation Using Cellular Automata

I would like to experiment with using cellular automata to generate cave structures. This has applications in procedural level generation for video games, artistic potential, and depending on the techniques used, it could also be useful in real world geological simulations. There has already been some work done in the area which gives ample room for extension and exploration. Most of the materials seem to be focused on 2D maze generation, so it may be fruitful to focus on generating 3D structures, or extending the existing techniques from 2D CA to 3D. 

Idea 3 : Origami Crease Pattern Generation Tool

I would like to build an application that generates an origami crease pattern from a 3D model, with a stretch goal being to implement 3D scanning from a camera. Folding techniques used in origami are seeing more and more uses in engineering contexts. The new Webb telescope for instance used origami for its heat shield and mirrors. Being able to generate crease patterns based on a source model may allow for more widespread use of similar techniques. This project will involve image analysis and potentially machine learning. 

Annotated Bibliography

with No Comments

Maze generation and solving:

  • Hai, Zhou, Maze Router: Lee Algorithm
    • These slides from Northwestern University give a great overview of maze routing algorithms. They discuss Lee algorithm, Handlock’s algorithm, Soukup’s algorithm, and more, along with their strengths and weaknesses, and runtime comparisons. This will be a useful resource to learn a little bit about various approaches and narrow down a few I might be interested in implementing.
      • Provides time and space complexity of different approaches presented
      • Some insights on ways to reduce the running time
      • Compares algorithms based on whether each of them is always able to find a path, whether the path found is the shortest, and whether the algorithm works on both grids and lines
      • Introduces some ideas about multi-layered routing
  • Xiao Cui, Hao Shi, A*-based Pathfinding in Modern Computer Games, IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.1, January 2011
    • This publication mainly focuses on A* based pathfinding in video games. According to the paper, this algorithm provides a provably optimal solution to pathfinding and has succeeded Dijkstra’s algorithm, BFS, DFS, and others. It explains how the algorithm works and shows the various implementation and optimization techniques. I believe the analysis of the latter will be most useful for the project I have in mind since I would like to look into the running speed/ memory consumption of maze-solving tools.
      • Provides optimization ideas for A* algorithm
      • Provides negative aspects of A*, including the huge amount of memory it requires to run
      • Describes some common applications of the algorithm and how it’s used to solve tricky problems
  • Geethu Elizebeth Mathew, Direction Based Heuristic for Pathfinding in Video Games, Procedia Computer Science   47  (2015)  262 – 271
    • This paper reviews current widely used pathfinding algorithms including A* and Dijkstra’s algorithm and proposes a new method, that the author claims to be more optimal than others. The method is based on direction heuristics, which ensures that parts of the map that are irrelevant remain unsearched. I find the approach extremely interesting and would like to explore some weaknesses it might have in more detail.
      • This approach doesn’t care too much about whether a path is the shortest mathematically, as long as it’s virtually short enough
      • The algorithm will only work in a grid-based environment as most of the game worlds are divided into grids for simplicity
      • The method focuses on reducing the resources used in the process as much as possible
      • Compares the behavior of the algorithm to other algorithms
  • Semuil Tjiharjadi, Marvin Chandra Wijaya, Optimization Maze Robot Using A* and Flood Fill Algorithm, Erwin Setiawan Maranatha Christian University, Bandung, Indonesia, International Journal of Mechanical Engineering and Robotics Research Vol. 6, No. 5, September 2017 
    • This paper expands on the usage of A* and Flood Fill algorithms in robotics. It discusses the hardware design of a robot created to solve a 3-dimensional maze, as well as testing results of both A* and the Flood Fill algorithm. This paper caught my curiosity since I am really interested in robotics as well, so I would love to recreate some of these tests and compare them to the approach discussed in the previous paper (direction heuristics-based method)
      • Based on the testing results, the size of the maze used in this research (5 x 5) was not sufficient enough to compare the differences between A* and the Flood Fill algorithm
      • Using a wider area is recommended for more accurate results and the distinction between methods, however, it may lead to other run-time-related complications, since as we know A* algorithm requires a lot of memory.
  • Walter D. Pullen, Maze Classification,, last updated March 22, 2021
    • This publication gives a great overview of a large list of maze creation and solving algorithms, along with some analysis and comparisons. Just like my source #1, I would like to use it to look deeper into some approaches and narrow down my options for maze generation, as well as solving.
      • Provides a large list of maze classifications based on dimension, hyperdimension, topology, routing, focus, and others
      • Provides possible maze generation and solving algorithms for each of the different types of mazes mentioned above
  • Jamis Buck, HTML 5 Presentation with Demos of Maze generation Algorithms
    • These slides explain how mazes work as spanning trees and how they are generated in fairly simple terms. It gives an overview of different types of mazes involving some graph theory and talks about generating spanning trees without bias comparing two different algorithms. I found this publication to be especially useful to deepen my understanding of mazes before I can work on more complex problems in this area.
      • Discussing mazes as spanning trees and comparing 3 different algorithms to create a uniform spanning tree
      • Interactive interface to help get a deeper understanding of each of these algorithms
      • Distinguishes between biased and unbiased mazes and provides algorithms for generating both
        • It seems that generating a biased maze is much faster

Web application security:

  • Patrick Engebretson, The Basics of Hacking and Penetration Testing: Ethical Hacking and Penetration Testing, Second Edition, 2013, 2011, Elsevier Inc.
    • This book focuses on ethical hacking and penetration testing. So far I’ve looked into chapters 1, 6, and 7 since I believe these are the most relevant to my area of interest. Chapter 1 introduces Kali Linux, a digital forensics and penetration testing tool I have used before, however, it also provides additional approaches and ideas I have not considered. Chapters 6 and 7 focus on the basics of web hacking including injection attacks that I want to focus on. What I like the most about this resource is that it gives examples, ways to practice, as well as possible applications.
      • Introduces various tools with tutorials on how to use them 
      • Has the structure of a textbook and is easy to follow
      • This is a great source for definitions and ethical hacking practice
  • Philipp Vogt, Florian Nentwich, Nenad Jovanovic, Engin Kirda, Christopher Kruegel, Giovanni Vigna, Cross-Site Scripting Prevention with Dynamic Data Tainting and Static Analysis, Secure Systems Lab Technical University Vienna
    • This article introduces a cross-site scripting prevention method using dynamic data tainting and static analysis. Part 3 of the article explains how the method works and offers some implementation approaches. I believe it will be a valuable addition to my web application security tool since one of the main goals of this project will be analyzing and comparing various attacks and their prevention mechanisms.
      • Includes discussion on both, server-side protection and client-side protection
      • It provides Static, as well as Dynamic data tainting analysis and compares their behaviors in terms of advantages and disadvantages, which will be useful to consider in my implementation
  • Hassanshahi B., Jia Y., Yap R.H.C., Saxena P., Liang Z. (2015) Web-to-Application Injection Attacks on Android: Characterization and Detection. In: Pernul G., Y A Ryan P., Weippl E. (eds) Computer Security — ESORICS 2015. ESORICS 2015. Lecture Notes in Computer Science, vol 9327. Springer, Cham. 
    • This article focuses specifically on web-to-app injection attacks for Android devices and presents a W2AI (web-to-app injection) scanner mechanism that detects vulnerabilities. This article provides lots of new information about web application hacking on android systems and how they differ from other systems, as well as examples of injection attacks and categorizes various W2AI vulnerabilities. From part 3 on, we learn about how the mechanism works starting from identifying a problem, to solving it and evaluating the results. Despite the fact that I was not initially planning to narrow down to android system web application security at first, given the fact that this is an extremely interesting yet underexplored area, I believe it might be a good focus opportunity for this project.
  • This article classifies SQL injection attacks and provides some countermeasures. Part 2 introduces some injection mechanisms and application examples, that I believe would be useful for ethical hacking tests, while part 5 shows prevention mechanisms. Unlike other sources I have looked into, this article discusses not only preventing data theft after detecting the attack, but also provides some defensive coding practices including input type checking, encoding of inputs, positive pattern matching and others, which I believe will be extremely useful in creating a secure web-application. 
  • This publication focuses specifically on preventing SQL injection attacks. Unlike other sources, this one seems to be the closest to what I had in mind for this project. The author starts off by writing a simple CGI application which allows the user to inject SQL into a “where” clause that expected an account ID. Without any validation, the user will be able to retrieve information concerning all possible accounts. Later they introduce a solution using Instruction-Set Randomization, describe its strengths and weaknesses and evaluate its performance. I believe this article gave me a more in-depth understanding of of possible risks and challenges associated with my project, and can be used as a guide for structuring my approach.
  • Justin Clarke, SQL Injection Attacks and Defense, 2012, Elsevier, Inc.
  • This book focuses on SQL injection attacks and defense. It gives a broader understanding of what a SQL injection is and provides examples of incorrect handling that may lead to exposing a vulnerability in a web application. I looked into Chapters 1, 3 and 4, since they seemed most relevant to my area of interest. They provide examples of dangerous coding behaviors and various ways to analyze code, as well as common exploit techniques. This book provides very broad and detailed information about various aspects in SQL injection in general, and I think it will be a useful resource whenever I feel lost or confused about some specific ideas in this area.
  • Jesse Burns, Cross-Site Request Forgery, ©2005, 2007, Information Security Partners, LLC. Version 1.2
  • This article introduces Cross-Site Request Forgery. While similar to Cross-Site scripting, it’s a separate security risk, and this publication describes some differences between the two. The final part of the article introduces 5 different protection approaches, along with their advantages and disadvantages. While CSRF is not the main focus of my project, given it is closely related to XSS, I would like to explore it if time permits and this article gives me a great understanding of some basic CSRF attacks and preventing mechanisms.

CS-388 Pitches

with No Comments
  • I want to create a tool that will protect a web application from malicious attacks. As a start, I would like to build a simple website where multiple users are able to sign up, authenticate and store information. Then, I will attempt ethical hacking using injection attacks, like cross-site scripting (XSS) and SQL injection (SQLi). If time permits, I will also explore URL manipulation, session-based attacks, cross-site request forgery, cookie highjacking. As a result, I will discover possible threats and vulnerabilities and improve data encryption, authorization, and access control.
  • I want to write software that will generate an n x n or an n x m maze, and find a way out of it. The maze will be represented as a binary matrix where 0 and 1 indicate whether a certain block can be used or not. At the same time, I would also like to analyze differences in memory consumption and runtime between cases where we are able to move in all four directions or only certain directions. 
  • I plan to work on a Japanese natural language processing tool. Japanese language, unlike many others, does not use any spacing between meaningful parts of the sentence. Using this application, I hope to help improve the language learning process for students. Since I have already done some work in this area, I have a working piece of software that successfully separates words, however, there still are some mistakes and inconsistencies. I want to improve the existing code and add more features, like a user-friendly interface, displaying word translations (using an external dictionary), parts of speech, and different readings. At the same time, I would like to provide an estimate of the reader’s language comprehension level. I have collected some datasets using web-scraping from an online dictionary (, which I plan to use for this project.

3 Pitches

with No Comments

Pitch #1

We all at some point have received that suspicious message stating that we are being watched or an annoying pop up in which insists that our devices are riddled with virus’s. I seek to find out how often and by what measure are people being trully attacked on there smart devices. As many smart devices do not offer robust cyber security systems they are more vulnerable to attack than other devices like computers. This software will provide an insight into the presence of hackers and malware on smart devices gathering data on the types of attacks to be wary of.

Pitch #2

New technologies have been evolving to aid life within the home. Video door bells, cameras and smart devices make many tasks much simpler than they use to be. However, the threat of security and ensuring that those with malicious intent are unable to hack and harm your home network has also increased, a failure in security could expose all of your personal information. As a result of this many organizations provide VPN services that have been developed as a means to protect people from the dangers of malicious hackers and malware. However, these same VPNS come with some faults such as higher cost and limitations as dictated by the provider , and the fact that paid services place you in the hands of the operator and its various cloud/network providers with no certainty that these providers will not snoop around in your data.

A VPN server that a user can host on there local machine solves all of these aforementioned problems with the added benefit of the user being able to securly access and maintain there home network.The server will be held in a virtual machine and will allow the user to be in complete control of it and its functions. This will increase efficiency of the VPN as the user no longer has to go through the network of the provider. My goal is to automate and open-source this process creating an easy launchable VPN server an average user can easily launch and use to maintain access to their home network.While at the same time being capable being edited and changed by the user for more robust security. I seek to compare this to similar paid services identifying which is more secure for the user.

Pitch #3

Many have been contacted by a scam caller and while most have the common sense to recognize the scam being played, thousands of Americans fall victim to such scams and end up paying a huge price for there mistake. While many assume these numbers mainly stem from the elderly, research has shown that people likey to fall for scams are broad in age group with the elderly being scammed for more money and the youth being scammed more frequently. To address this issue I seek to create a real time speech and text recognition answering bot that is capable of answering on phone calls from unknown numbers and through certain verbal ques will be able to deduce weather or not the person on the other end is scammer or not. With this bot I will be able to gather data on the most common types of scams and improve upon existing scam blocker software.

CS 488: Senior Capstone – Final Submission

with No Comments


Voting is an important Ensemble Learning technique. However, there has not been much discussion about leveraging the base classifiers’ consensus on unlabeled data to better inform the final prediction. My proposed method identifies the data points where the ensemble reaches consensus and where conflict arises in the unlabeled space. A meta weighted KNN model is trained upon this half-labeled set with the labels of the consensus and the conflict points marked as “Unknown,” which is treated as a new, additional class. The predictions of the meta model are expected to better inform the decision of the ensemble in the case of conflict. This research project aims to implement my proposed method and evaluate it on a range of benchmark datasets.




Senior Capstone

with No Comments

Chest X-Ray Abnormalities Detection with a focus on Infiltration


Navigating chest X-rays is an obligatory step to determine lung and heart diseases. Since many people now believe that Chest X-ray radiographs can detect COVID-19, the disease of the decade, the problem of Chest X-ray Abnormalities Detection has gained increasing attention from researchers. Numerous machine learning algorithms have been developed to address this problem to raise reading accuracy, improve efficiency, and save time for both doctors and patients. In this work, I propose a model to determine whether a Chest X-ray image has Infiltration and to detect the abnormalities in that image using YOLOv3. The model will be trained and tested with the VinBigData dataset. Overall, I will use the existing tool, YOLOv3, on a new problem of detecting Infiltration in Chest X-ray radiographs.


Software Architecture Diagram

CS488: Final Capstone Deliverables

with No Comments

Project abstract:

Heuristic evaluation is one of the most popular methods for usability evaluation in Human-Computer Interaction (HCI). However, most heuristic models only provide generic heuristic evaluation for the entire application as a whole, even though different parts of an application might serve users in different contexts, leading HCI practitioners to miss out on context-dependent usability issues. As a prime example, mobile search interfaces in e-learning applications have not received a lot of attention when it comes to usability evaluation. This paper addresses this problem by proposing a more domain-specific and context-dependent heuristic evaluation model for search interfaces in mobile e-learning applications. I will analyze studies on mobile evaluation heuristics, in combination with research in mobile search computing and e-learning heuristics, to generate a heuristic model for mobile e-learning search interfaces.

Software architecture diagram

Research paper

Software demonstration video

Senior Capstone: Cryptocurrency Price Prediction using Sentiment Analysis and Machine Learning

with No Comments

Predicting cryptocurrency price movements is a well-known problem of interest. In this modern age, social media represents the public sentiment about current events. Twitter especially has attracted a lot of attention from researchers who are studying the public sentiments. Recent studies in natural language processing develop automatic techniques in analyzing sentiment in social media information. This research is directed towards predicting volatile price movement of cryptocurrency by analyzing the sentiment on social media and finding the correlation between them. Machine learning algorithms including support vector machine and linear regression will be used to predict the prices. The most efficient combination of machine learning algorithms and the datasets being used will be determined.

Software Architecture Diagram

Link to video tutorial:

Link to senior paper:

Jarred – Three Pitches

with No Comments

Pitch 1 – Develop an API for Sensor

I was hoping to work with sensors to develop an interface/app that would be more useful for
individuals’ projects. Say there was this sensor that measured water quality that was inexpensive,
but pretty rudimentary and it doesn’t really have a good API already, I could make an app that
would work with that for individuals who are students or just wanting to do small experiments
with water qualities nearby their house or school. I would work with programming and
interfacing with the hardware sensor. I’m not entirely sure of the dataset I would use or need for
this. I could use one that is already present to work on different graphing techniques in the API.

Pitch 2 – Research into Feasibility and Practicality of Low-Cost Portable Air Quality Sensor and Network with Smartphone Connection

Similar to the sensor pitch I did before, this would be working with air sensors that could be put
into a compact container that would be attached to someone. The sensors would be built into an
Arduino, Charlie said he would show me some sensors and how to use the Arduino sometime
this week when he is back on campus. The air sensors would constantly read for different
dangerous compounds in the air and would notify the user through a smartphone app and track
different amounts in different places. The sensor would not track location, but the app on the
phone would because the sensor would probably be connected to the phone through Bluetooth or
some other possible wireless connection and therefore have to be near the phone to work. This
would be very useful for emergency responders and for maintenance people who work with
boilers and furnaces.

Pitch 3 – Land-Based Oil Spill Modeling

I was thinking about doing something within the realm of climate change simulations. Maybe
something along the lines of the damages of an oil pipeline bursting and then being carried
inland by a flood that was caused by excess rainfall due to climate change. Or perhaps a wildfire
simulation or smoke pollution simulation from said wildfires. This would just help to get an idea
of how we could mitigate these catastrophes while the risk is still present. It would require a lot
of machine learning and GIS information

Lam – CS388 Three Pitches

with No Comments

Idea 1: Detect and predict mental health status using NLP

The objective of that research is to determine whether NLP (Natural Language Processing) is useful for mental health treatment. Data sources vary from traditional sources (EHRs/EMRs, clinical notes and records) to social sources (tweets and posts from social media platforms). The main subject for data will be the social media posts because they enable immediate detection of users’ possible mental illness. They are preprocessed by extracting the key words to form an artificial paragraph that contains only numbers and characters. Then, an NLP (Natural Processing Language) model, built on genuine and artificial data, will test the real (raw/original) data and compare with the artificial paragraph above. The outputs will be the correct mental health situation from the patients after the testing.

Libraries such as Scikit and Keras are frequently used in a lot of papers. The dataset I would like to experiment is the suicide and depression dataset from Reddit posts starting at \r.

Idea 2: Using Machine Learning Algorithms to predict air pollution

Machine Learning algorithms can predict air pollution based on the amount of air pollutant in the atmosphere, which is defined by the level of particulate matters (we will look at PM2.5 – the most dangerous air pollutant to human’s health – specifically). Given a data indicating PM 2.5 level in different areas, that data has to be preprocessed. Then, it will undergo several machine learning models, notably random forest and linear regression due to their simplicity and usage for regression and classification problems, to produce the best model for forecasting the presence of PM 2.5 level in the air.

Python and its library Scikit are the most common tools to train models. For this idea, I would like to select a dataset to measure air quality of 5 Chinese cities and it is available on Kaggle.

Idea 3: Machine Learning in music genre classification

Given an input (dataset) of thousands of songs from different genres such as pop, rock, hip hop/rap, country, etc, the audio will be processed by reducing/filtering noise and the soundwave. Then, the neural network model will take those audio inputs and images of their spectrograms to differentiate several objects as their neural network outputs. Eventually, the KNN method, due to its simplicity and robust with large noisy data, will test the music from the model outputs and classify the songs into their respective genres.

Python is mostly involved in the research due to the wide range of usage of its dynamic libraries such as Keras, Scikit, Tensorflow, and even Numpy that support the findings. This idea, I would like to choose Spotify dataset thanks to its rich collection of songs.

with 1 Comment

Pitch 1

Social media sentiments to predict mental health of people during the stages of pandemic in the United States

  1. I think it is important to know how people are feeling, whether they are hopeful, devastated, excited, or feeling fine during the pandemic.
  2. With new variants showing up, people’s sentiments may keep changing. So, I want to see if virus variants, mask mandates loosening up, or vaccine rollout had an impact on people’s sentiments. For example, people were probably being hopeful and excited after getting vaccinated but the delta variant may have increased the negative sentiments.
  3. I plan to use Twitter data with a sentiment analysis machining learning model and then visualizing the data with interactive charts
  4. I will be using Twitter’s API code and collect the tweets with #covid19, #pandemic, and other covid related hashtags (#).
  5. Timeline of the covid-19 event (From the timeline below I will only pick certain events):
  6. Resources:

Pitch 2

Social media sentiments to predict the covid situation and vaccine rollout in different countries

  1. Why is this important?
    • ●  In the US people may be more hopeful and excited since the vaccination rate is going up and mask mandates are getting removed.
    • ●  However, there are parts of the world where the majority of the population is not vaccinated, and a lot of people are dying of covid.
    • ●  So, it would be interesting to predict the parts of the world with higher vaccination rates and better situations and parts of the world with lower vaccination rates and worse situations through Twitter sentiments.
  2. I plan to use Twitter data with a sentiment analysis machining learning model and then visualizing the data with interactive charts
  3. I will be using Twitter’s API code and collect the tweets with #covid19, #pandemic, and other covid related hashtags (#) and categorize these according to country names.
  4. I might cherry-pick 5-6 countries only.
  5. Resources:
Pitch 3

Predicting the tourist volume and touristic behavior with search engine data

  1. To be able to predict the volume of tourists arriving at a certain destination is important in order to plan and make available adjustments according to the demand.
  2. Especially during this global pandemic, it is very helpful to know the volume of the tourist arriving at the destination so that there can be proper and enough arrangements for testing, quarantining, and knowing whether the area is going to become a high risk for covid.
  3. I will be using google trends data, and use keywords related to tourism to predict the volume of tourists.
  4. I will be doing this for only 5 countries.
  5. To see if my predictions are correct I will test them with available data from the past.
  6. Data for tourism volume -> l-tourist-arrivals/


CS 388: Three pitches

with No Comments

Cryptocurrency price prediction

This would predict crypto currency prices using deep learning. As cryptocurrency popularity is increasing in the modern age and the money flow is increasing this is making cryptocurrencies more volatile and the patterns are changing. Some of the problems faced are as unlike the stock market cryptocurrencies are dependent on factors such as its technological progress and internal competition etc. I plan to get data from news agencies about specific tokens and also data of all the price changes in crypto from 2012 to predict the future prices. As per my research this would require the use of LSTM neural networks. Some of the many places this data can be found is on Cryptocompare API: XEM and IOT historical prices in hour frequency, Pytrends API: Google News search frequency of the phrase “cryptocurrency”, Scraping Subreddit “CryptoCurrency,” “Nem,” and “Iota” subscription growth. We can predict the price by  Identifying the Cointegrated Pair. This is a popular method used to stationarize time series. It can be used to remove trends and seasonality. Taking the difference of consecutive observations is used for this project.

Gender and age detection using deep learning

This would predict the age and gender of a person using a picture of a person or live view using a webcam. The predicted gender may be one of ‘Male’ and ‘Female’.  It is very difficult to accurately guess an exact age from a single image because of factors like makeup, lighting, obstructions, and facial expressions.I will be using the Adience dataset; the dataset is available in the public domain. This dataset serves as a benchmark for face photos and is inclusive of various real-world imaging conditions like noise, lighting, pose, and appearance. As there are already multiple studies done on this topic, factors which affect the efficiency of the program can be worked on.

Forest fire detection using k-clustering

This model would detect forest fires using the Keras Deep Learning library. As seen recently around the world in places such as the Amazon rainforest and a prominent part of Australia, wildfires are increasing in this era. These disasters are damaging to the ecosystem like damaging habitat and releasing carbon dioxide. This project can be built using k-means clustering. This model would be able to identify any forest fires hotspots along with the intensity of the fire at that particular spot which would result in either the model detecting if it’s a wildfire or not. There is another way of making this using the neural network MobileNet-V2 or U-net which is more efficient and I will be researching more on this. There is a data set compiled with over 1300 images that would be used to detect wildfires.  The data for this project can be found at

CS 388: The first three pitches

with No Comments

A new voting ensemble scheme

  • Voting is a popular technique in Machine Learning to aggregate the predictions of multiple models and produce a more robust prediction. Some of the most widely used voting schemes are majority voting, rank voting, etc. I hope to propose a new voting system that particularly focuses on solving the issue of overfitting by using the non-conflict data points to inform the prediction of data points where conflict does arise.
  • Evaluation: 
    • Compare its overall performance with that of the popular voting schemes 
    • Examine ties → see if it’s better than flipping a coin 
    • Apply statistical hypothesis testing to these analyses
  • Possible datasets to work with:
    • Any dataset whose target is categorical (i.e. it’s a classification problem). Preferably, the features are numerical and continuous.  

Comparing the performance of the different Hyperparameter Tuning methods

Hyperparameter Tuning is an important step in building a strong Machine Learning model. However, that the hyperparameter space grows exponentially and the interaction among the hyperparameters is often nonlinear limits the number of feasible methods to come up with a more optimal set of hyperparameters. I plan to examine some of the most common methods that are often used to tackle this problem and compare their performance:  

  • Grid Search
  • Random
  • Sobol (hybrid of the two aforementioned methods)
  • Bayesian Optimization

To many people’s surprise, the Random brute-force technique sometimes outperforms the Grid Search method. My project aims to verify this claim by applying the techniques above to a range of benchmark datasets and prediction algorithms. 

The referee classifier 

This is another Voting ensemble scheme. We pick out the classifier that performs the best under conflict and give it the role of the referee to solve “dispute” among the classifiers. The same principle can be used for breaking ties but we can also try removing the classifier that performs the worst under conflict. 

We can try out a diverse set of classification algorithms like Decision Tree, Support Vector Machine, KNN, Naive Bayes, Logistic Regression, etc. and run them on the benchmark datasets from UCI. This proposed voting scheme can then be compared against the more common Simple Majority Voting Ensemble approach in terms of accuracy and other performance metrics.

Content-based Hashtag Recommendation Methods for Twitter

with No Comments


For Twitter, a hashtag recommendation system is an important tool to organize similar content together for topic categorization. Much research has been carried out on figuring out a new technique for hashtag recommendation, and very little research has been done on evaluating the performance of different existing models
using the same dataset and the same evaluation metrics. This paper evaluates the performance of different content-based methods(Tweet similarity using hashtag frequency, Naïve Bayes model, and KNN-based cosine similarity) for hashtag recommendation using different evaluation metrics including Hit Ratio, a metric recently created for evaluating a hashtag recommendation system. The result shows that Naive Bayes outperforms other methods with an average accuracy score of 0.83.

Software Architecture Diagram

Fig 1: Software Architecture Diagram

Pre-processing Framework

Evaluation Model

Evaluation Results

Design of the Web Application

Senior Capstone: Cancer metastasis detection using CNNs and transfer learning

with No Comments


Artificial Intelligence (AI) has been used extensively in the field of medicine. More recently, advanced machine learning algorithms have become a big part of oncology as they assist with detection and diagnosis of cancer. Convolutional Neural Networks (CNN) are common in image analysis and they offer great power for detection, diagnosis and staging of cancerous regions in radiology images. Convolutional Neural Networks get more accurate results, and more importantly, need less training data with transfer learning, which is the practice of using pre-trained models and fine-tuning them for specific problems. This paper proposes utilizing transfer learning along with CNNs for staging cancer diagnoses. Randomly initialized CNNs will be compared with CNNs that used transfer learning to determine the extent of improvement that transfer learning can offer with cancer staging and metastasis detection. Additionally, the model utilizing transfer learning will be trained with a smaller subset of the dataset to determine if using transfer learning reduced the need for a large dataset to get improved results.

Software architecture diagram

Links to project components

Link to complete paper:

Link to software on gitlab:

Link to video on youtube:

Link to copy of poster:

Senior Capstone

with No Comments



Software overview


Writer identification based on handwriting plays an important role in forensic analysis of the documents. Convolutional Neural Networks have been successfully applied to this problem throughout the last decade. Most of the research that has been done in this area has concentrated on extracting local features from handwriting samples and then combining them into global descriptors for writer retrieval. Extracting local features from small patches of handwriting samples is a reasonable choice considering the lack of big training datasets. However, the methods for aggregating local features are not perfect and do not take into account the spatial relationship between small patches of handwriting. This research aims to train a CNN with triplet loss function to extract global feature vectors from images of handwritten text directly, eliminating the intermediate step involving local features. Extracting global features from handwriting samples is not a novel idea, but this approach has never been combined with triplet architecture. A data augmentation method is employed because training a CNN to learn the global descriptors requires a large amount of training data. The model is trained and tested on CVL handwriting dataset, using leave-one-out cross-validation method to test the soft top-N, hard top-N performance.

Software Architecture


I’m using CVL writer database to train the model. All handwriting samples go through the data augmentation and pre-processing step to standardize the input for CNN. The samples in the training set get augmented, whereas only one page is produced per sample for the test set. The triplets of samples are chosen from each batch to train the CNN. The output of the CNN is a 256D vector. In order to evaluate the model, we build a writer database for samples in the test set.

Data Augmentation

Each handwriting sample goes through the same set of steps:
1. Original handwriting sample.

2. Sample is segmented into words.

3. The words from a single sample are randomly permuted into a line of handwriting. The words are centered vertically.

4. Step 2 is repeated L times to get L lines of handwriting. These lines are concatenated vertically to produce a page.

5. A page is then broken up into non-overlapping square patches. The remainder of the page is discarded. The resulting patches are resized to 224×224 pixels.

6. Steps (4) and (5) are repeated N times.

7. Finally we apply binarization. The patches are thresholded using adaptive Gaussian Thresholding with 37×37 kernel.

CNN framework

The CNN model consists of 3 convolutional blocks followed by a single fully connected layer. Each convolutional block includes a 2D convolutional, batch normalization, max pooling and dropout layers. The final final 256D output vector is normalized. I implemented this CNN framework in keras with tensorflow backend.

The model was trained for 15 epochs with batch gradient descend and Adam optimizer, with an initial learning rate of 3e-4. 10 epochs of training with semi-hard negative triplet mining was followed by 5 epochs of hard negative triplet mining.

Using Online Sentiment to Predict Stock Price Movement

with No Comments

CS488: Senior Capstone Project

-Muskan Uprety


Stock Prediction, Sentiment Analysis, Stock Price Direction, Social Media Sentiment

1. Abstract

Due to the current pandemic of the COVID-19, all the current models for investment strategies that were used to predict prices could become obsolete as the market is in a new territory that has not been observed before. It is essential to have some predictive and analytical ability even in times of a global pandemic as smart investments are crucial for securing the savings for people. Due to the recent nature of this crisis, there is limited research in tapping predictive power of market sentiment when a lot of people are deprived from extracurricular activities and thus have turned their focus in capital markets. This research finds that there is evidence of market sentiment influencing stock prices. Adding market sentiment to the classification improved the prediction power of the model as compared to just price and trend information. This shows that sentiment analysis can be used to make investment strategies as it has influence over the price movements in the stock market. This research also finds that looking at the sentiment of posts up to one hour into the past yields the best predictive abilities in price movements

Figure 1. Software Architecture Diagram

Link to Paper: Paper

Link to Source Code: GitLab

Link to Demonstration Video: YouTube

Link to Poster: Poster

Capstone Proposal

with No Comments

Stock Price Prediction using Online Sentiment

Stock Price Prediction using Online Sentiment

Muskan Uprety Department of Computer Science Earlham College Richmond, Indiana, 47374


Stock prediction, sentiment analysis, price direction prediction



Due to the current pandemic of the COVID-19, all the current mod- els for investment strategies that were used to predict prices could become obsolete as the market is in a new territory that has not been observed before. It is essential to have some predictive and analytical ability even in times of a global pandemic as smart invest- ments are crucial for securing the existence of savings for people. Due to the recent nature of this crisis, there is limited research in tapping predictive power of various sectors in these unique times. This proposal aims to use texts from online media sources and analyze of these texts hold any predictive powers within them. The proposed research would gather the population sentiment from the online textual data during this pandemic and use the data gathered to train a price prediction model for various stocks. The goal of this research is to check if the prediction model can help make investment strategies that outperforms the market.


The unpredictability of stock prices has been a challenge in the Finance industry for as long as stocks have been traded. The goal of beating the market by stockbrokers and experts of the industry have not been materialized, however, the availability of technological resources has certainly opened a lot of doors to experimenting different approaches to try and fully understand the drivers of stock prices. One of the approaches that have been used extensively to try and predict price movements is Sentiment Analysis. The use of this tool is predicated on the fact that stakeholder’s opinion is a major driving force and an indicator for future prices of stock prices. Indicators can be of two types: those derived from textual data (news articles, tweets etc.), and those derived from numerical data (stock prices) [5].

Our research would focus on the textual data derived from sources like Twitter and online news media outlet to analyze if the sentiments in those texts can be used to make price movement predictions in short or long term. It is very important to have some analytical capabilities, specially during and after the COVID-19 pandemic as the outbreak has caused the entire world to be in a shutdown closing businesses indefinitely and burning through peo- ple’s savings. Some panic among consumers and firms has distorted


© 2021

usual consumption patterns and created market anomalies. Global financial markets have also been responsive to the changes and global stock indices have plunged [10].The effect of this disease has created a lot of shifts in the financial industry, along with in- crease in volatility and uncertainty. Due to the pandemic being so recent and ongoing, there is a lack of research on what factors are responsible for the high level of variance and fluctuations in the stock market. While there are papers that analyzes sentiment in text to predict price movements, the applicability of those models in context of a global pandemic is untested. It is vital to be able to have some predictive ability in the stock market as succumbing solely to speculation and fear could result in devastating loss of wealth for individual consumers and investors. The goal of this research is to :

• be able to predict human sentiment from social media posts, • use sentiment to predict changes in price in the stock market • recommend investors to buy, sell, or hold stocks based on

our model
• beat a traditional buy and hold investment strategy.

This paper will first discuss some of the related work that was analyzed in making decisions for algorithms and data to use. After brief discussions about the models to be used in the research, the paper discusses the entire framework of the paper in detail and also discusses the costs associated with conducting this research if any. Finally, the paper proposes a timeline in which the research will be conducted and ends with the references used for this proposal.


The entire process of using sentiment to predict stock price move- ment is divided into two parts. The first is to to extract the sentiment from the text source using various algorithms, and the second step is to use this information as an input variable to predict the move- ment and direction of the different stocks’ price. various researchers have used different algorithms and models for each of this process. The difference in model is based on the overall goal of the research, the type of text data used for analysis, input parameters used for classifier, and the type of classification problem considered.

3.1 Sentiment analysis

The process of conducting sentiment analysis requires some form of text data that is to be analyzed, and some labels associated with these textual data that allows the model to distinguish if a collection of text is positive or negative in nature. It is quite difficult to find a data set that has been labelled by a human to train the model to make future predictions. Various researchers have used different measures to label their text data in order to train a model to predict

, ,

Muskan Uprety

whether an article or tweet is positive or negative (or neutral but this is not very relevant).

There are a few approaches used by researchers to train their sentiment analysis model. Connor et al. uses a pre-determined bag of words approach to classify their texts [12]. They use a word list which consists of hundreds of words that are determined positive or negative. If any word in the positive word list exists in the text, the text is considered as positive and if any word in the negative word list exists in the text, it is considered negative sentiment. In this approach, an article or tweet can be considered both positive and negative sentiment.

Gelbukh also uses a similar approach of using a bag of words to analyze sentiment [4]. Instead of having pre-determined set of positive and negative word list, they look at the linguistic makeup of sentences like the position of opinion or aspect words in the sentence, part of speech of opinions and aspects and so on to deduce sentiment of the text article.

In the absence of a bag of words for comparison, Antweiler and Frank manually labelled a small subset of tweets into a positive (buy), negative (sell), or neutral (hold classification [1]. Using this as reference, they trained their model to predict the sentiment of each tweet they had into one of these three categories. The researchers also noted that the messages and articles posted in various messaging boards were notably coming from day traders. So although the market price of stocks may reflect the sentiment from entire market participants, the messaging boards most certainly aren’t.

While most researchers looked for either pre-determined bag of words or manually labelled data set, Makrehchi et al automated labelling the text documents [9]. If a company outperforms expec- tations or their stock prices goes higher compared to S&P 500, they assume that the public opinion would be positive and negative if prices go lower than S&P500 or under performs than expectations. Each tweet is turned into a vector of mood words where the column associated with the mood word becomes 1 if the word is mentioned in the tweet and 0 if it isn’t. This way, they train their sentiment analysis model with automated labels for tweets.

3.2 The Prediction Model

All the papers discussed in this proposal use the insights gained from text data collected from various sources. However, the ultimate goal is to check if there is a possibility to predict movement in stock market as a result of such news and tweets online. And researchers use different approaches to make an attempt at finding relationships between online sentiment and stock prices, and they are discussed in this section.

Markechi et. al. collected tweets from people just before and after a major event took place for a company [9]. They used this information to label the tweets as positive or negative. If there was a positive event, the tweets were assumed to be positive and negative if there was a significant negative event. The researchers use this information to train their model to predict the sentiment of future tweets. Aggregating the net sentiment of these predicted tweets, they make a decision on whether to buy, hold, or sell certain stocks. Accounting for weekends and other holidays, they used

Table 1: Papers and Algorithms Used


Makrehchi et al. [9] Nguyen and Shirai [11] Atkins et al. [2] Axel et al. [6] Gidofalvi [5]

Algorithm Discussed

time stamp analysis TSLDA
K mean clustering Naive Bayes

this classification model to predict the S&P 500 and were able to outperform the index by 20 percent in a period of 4 months.

Nguyen and Shirai proposed a modified version of Latent Dirich- let Allocation (LDA) model which captures topics and their senti- ments in texts simultaneously [11]. They use this model to assess the sentiments of people over the message boards. They also label each transaction as up or down based on previous date’s price and then use all these features to make prediction on whether the price of the stock in the future will rise or fall. For this classification problem, the researchers use the Support Vector Machine (SVM) classifier which has long been recognized as being able to efficiently handle high dimensional data and has been shown to perform well on many tasks such as text classification [4].

Atkins et. al. used data collected from financial news and used the traditional LDA to extract topic from each news article [2]. Their reasoning of using LDA is that the model effectively reduces features and produces a set of comprehensible topics. Naïve Bayes classifier model follows the LDA model to classify future stock price movements. The uniqueness of this research is that it aims to predict the volatility in prices for stocks instead of trying to predict closing stock prices or simply the movement. They limit their prediction to 60 minutes after an article gets published as the effect of news fades out as time passes.

Axel et. al. collected tweets from financial experts, who they de- fined as people who consistently tweet finance related material, and then used this set of data for their model [6]. After pre-processing the data and reducing dimensions of the data, they experimented with both supervised and unsupervised learning methods to model and classify tweets. They use K mean clustering method to find user clusters with shared tweet content. The researchers then used SVM to classify the tweets to make decisions on buying or selling a stock.

Gidófalvi labelled each transaction as up or down based on pre- vious day pricing [5]. They also labelled articles with the same time frame with similar labels. That is, if a stock price went up, articles that were published immediately before or after also were labelled as up. They then trained the Naïve Bayes text classifier to predict which movement class an article belonged to. Using this predicted article in the test dataset, they predicted the price movement of the corresponding stock. The researchers, through experimentation, found predictive power for the stock price movement in the inter- val starting 20 minutes before and ending 20 minutes after news articles become publicly available.

In terms of classifying text from online media, there was two typed of approaches generally used by researchers. [5] and [9] have used the approach of classifying a text as positive if the text

Stock Price Prediction using Online Sentiment

, ,

was published around a positive event, and classified it as negative sentiment if it was published around the time of some negative event. Their method doesn’t actually look at the contents of the text but rather the time the text was published in. On the other hand, [2] [6] and [11] ,as one would expect, looks inside the text and analyze the contents to classify on whether the authors of these texts intended a positive or negative sentiment. Although algorithms like the LDA or the modified version discussed in [11] are the more intuitive approaches, the fact that classifying texts based on time also yields good results makes me think if reading the text and using computational resources are actually necessary. In the other hand, researchers seem to consistently agree on using SVM as it is widely used in classification problems. Analyzing the various papers, we believe that SVM is the most effective classifier for this stock prediction problem. In the case of sentiment analysis, we believe that more experiments should be done to conduct a cost benefit analysis of actually reading the text for sentiment analysis versus the potential loss of accuracy by just analyzing the time stamp of an article’s publishing.

4.1 Framework

Figure 1: Prediction model framework

The framework of this project is shown in Figure 1. The entire project can be divided into two components. The first part is to use the text data collected from various news articles, messaging boards and tweets and create a sentiment analysis model that is able to extract the average sentiment for individual companies. After that, we use the information from this model to train a price prediction classifier which will predict the direction of the stocks prices. Based on this prediction, the overall output will be a recommendation of buy, sell or hold, which we will use to make investment decisions. The recommendation will be made based on predicted price, current price and purchase price. We will analyze the returns of investment strategies suggested my our model, and compare if our profitability would be better than simply investing and holding an index like the S&P 500.

4.2 Data Collection

In order to gather the sentiment from the entire market, we will diversify our sources of textual data. As Antweiler and Frank dis- covered, messaging board comments were heavily skewed towards day traders [1]; we assume the informal setting of Twitter suggests the input of more traditional and individual investors. We also want to include news articles to capture the sentiment of stakeholders that may have been omitted by the other platforms mentioned. In terms of stock prices, we are using Yahoo Finance which is being consistently used by other researchers to capture the stock prices across time. All of this data will be stored in a Postgres database management system

4.3 Capturing Sentiment

For the purpose of categorizing the text documents as positive and negative sentiments, we are going to compare the Rocchio Algo- rithm [13] and the Latent Dirichlect Allocation (LDA) model [3]. These are the two most discussed method and are the algorithms used by researchers to conduct binary classification of textual sen- timent.

After cleaning the data by eliminating non essential text data (filter texts using keywords), we will transform the data into the form needed by the models. Both these algorithms use a bag of word approach where we select a collection of words that are determined to be capturing sentiment in text. For Rocchio algorithm, each text is converted into a vector of words where we keep track of the frequency of words in the text and use this representation of text to predict sentiment. The basic idea of LDA, on the other hand, is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words [3].

4.4 Price Prediction

Results generated from the Sentiment Analysis model will be used along with a few other input variables from text data to train a classifier which will allow us to make price prediction for individual stocks. For the price prediction model, decision tree [14], support vector machine (SVM) [7] and possibly the Naive Bias models [8] will be tested. We will compare these various models and analyze which model and algorithm produces the best outcome.


This project is primarily a python project with some integration of database management system. We intend to use Python pro- gramming, Postgres database management system, Django, and possibly some visualization tool like Tableau. All the computational resources and storage devices are available in Earlham. Since we do not require the purchase of any software, and we anticipate the data to be available online for no cost, there is no cost anticipated at this time.


• week 1-3: finalize the data sources. Research the method/ process to extract data from sources.

• week4-5:ExtractthedatausingappropriateAPIandcreatea database to store the data. Request hardware resources from

, ,

Muskan Uprety

system admins. Have a clear idea of access and updating


  • week 5-6: Start implementing the Sentiment analysis algo- rithm to generate/ predict the sentiment of text data.
  • week 7-8: Test the sentiment analysis model. After confir- mation of the model’s functionality, begin working on the price prediction classifier. Experiment with all the different models for the specific project.
  • week 9-10: Finalize which classifier model is the most accu- rate and is yielding the highest predictability.
  • week 11-12: Compile everything together. Use the final rec- ommendation from the algorithm to make investment deci- sions. Compare the investment decisions coming from the algorithm against the traditional investing and holding a market index like S&P 500.
  • week 13-14: Showcase and present the findings and results. 7 ACKNOWLEDGEMENT I would like to thank Xunfei Jiang for helping draft this proposal, along with the entire Computer Science Department of Earlham for providing feedback on the project idea. REFERENCES
  1. [1]  Werner Antweiler and Murray Z. Frank. 2004. Is all that talk just noise? The information content of Internet stock message boards. Journal of Finance 59, 3 (2004), 1259–1294.
  2. [2]  Adam Atkins, Mahesan Niranjan, and Enrico Gerding. 2018. Financial news predicts stock market volatility better than close price. The Journal of Finance and Data Science 4, 2 (2018), 120–137.
  3. [3]  JoshuaCharlesCampbell,AbramHindle,andEleniStroulia.2015.LatentDirichlet Allocation: Extracting Topics from Software Engineering Data. The Art and Science of Analyzing Software Data 3 (2015), 139–159. B978- 0- 12- 411519- 4.00006- 9
  4. [4]  Alexander Gelbukh. 2015. Computational Linguistics and Intelligent Text Pro- cessing: 16th International Conference, CICLing 2015 Cairo, Egypt, April 14-20, 2015 Proceedings, Part II. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9042, August (2015).
  5. [5]  Gyözö Gidófalvi. 2001. Using news articles to predict stock price move- ments. Department of Computer Science and Engineering University of Cali- fornia San Diego (2001), 9. arXiv:arXiv:0704.0773v2
  6. [6]  AxelGroß-Klußmann,StephanKönig,andMarkusEbner.2019.Buzzwordsbuild momentum: Global financial Twitter sentiment and the aggregate stock market. , 171–186 pages.
  7. [7]  Xiaolin Huang, Andreas Maier, Joachim Hornegger, and Johan A.K. Suykens. 2017. Indefinite kernels in least squares support vector machines and principal component analysis. Applied and Computational Harmonic Analysis 43, 1 (2017), 162–172.
  8. [8]  EdmondP.F.Lee,EdmondP.F.Lee,JérômeLozeille,PavelSoldán,SophiaE.Daire, John M. Dyke, and Timothy G. Wright. 2001. An ab initio study of RbO, CsO and FrO (X2+; A2) and their cations (X3-; A3). Physical Chemistry Chemical Physics 3, 22 (2001), 4863–4869.
  9. [9]  MasoudMakrehchi,SameenaShah,andWenhuiLiao.2013.Stockpredictionusing event-based sentiment analysis. Proceedings – 2013 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2013 1 (2013), 337–342. 1109/WI- IAT.2013.48
  10. [10]  Warwick J. McKibbin and Roshen Fernando. 2020. The Global Macroeconomic ImpactsofCOVID-19:SevenScenarios.SSRNElectronicJournal(2020). https: //
  11. [11]  Thien Hai Nguyen and Kiyoaki Shirai. 2015. Topic modeling based sentiment analysis on social media for stock market prediction. ACL-IJCNLP 2015 – 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference 1 (2015), 1354–1364. 1131
  12. [12]  BrendanO’Connor,RamnathBalasubramanyan,BryanR.Routledge,andNoahA. Smith. 2010. From tweets to polls: Linking text sentiment to public opinion time

series. ICWSM 2010 – Proceedings of the 4th International AAAI Conference on

Weblogs and Social Media May (2010), 122–129.
[13] J. ROCCHIO. 1971. Relevance feedback in information retrieval. The Smart

Retrieval System-Experiments in Automatic Document Processing (1971), 313–323.
[14] P.H.SwainandH.Hauska.1977.Thedecisiontreeclassifier:Designandpotential.

IEEE Transactions on Geoscience Electronics 15, 3 (1977), 142–147.

Senior Capstone – Finished

with No Comments

Final Paper

A final version of the paper can be found here.

Final Repository

A final version of the repository can be found here.


Demographic weighting was added, as was a file that turns Rhode Island into 3 districts. Unfortunately, Pennsylvania and Kentucky data was never added due to time constraints, and such has been noted. The paper has been updated to contain results and future work recommendations.

Senior Capstone

with No Comments

Finding correlation between fake news and correspondingsentiment analysis


Detection of misinformation has become of great relevance and importance in the past few years. A significant amount of work has been done in the field of fake news detection using natural text processing tools combined with many other filtering algorithms. However, these studies lacked to observe any possible connection that might exist between the tone of the news and the validity of it. In order to research this field and find any existent correlation, my project addresses the potential role that sentiment associated with the news plays in identifying its validity. I perform sentiment analysis on tweets through natural language processing and use neural networks to train the model and test its accuracy.

Final Paper

Final Software Delivery

Final Software Diagram

Senior Capstone – Wildfire Simulation Program

with No Comments


With the increase in the number of forest fires worldwide, especially in the West of the United States, there is an urgent need to develop a reliable fire propagation model to aid fire fighting as well as save lives and resources. Wildfire spread simulation is used to predict possible fire behavior, which is essential in assisting fire management and training purposes. This paper proposed an agent-based model that simulates wildfire using a cellular automata approach. The proposed model incorporated a machine learning technique to automatically calculate the igniting probability without the need to manually adjust the input data for a specific location.

Software architecture diagram

The program includes three large components: the input service, the simulation model, and the output service.

The input service processes users inputs. The input includes diffusion coefficient, ambient temperature, ignition temperature, burn temperature, matrix size, and a wood density data set. All of the inputs can be set to default values if users choose not to provide data. The wood density data set can also be auto-generated if a real-world data set is not provided.

The simulation model is the most important component of the program. This part consists of two main parts, fire temperature simulation service and the wood density simulation service. As the names suggest, the fire temperature simulation service is responsible for processing how fire temperature changes throughout the simulation process. The wood density simulation service is in charge of processing the changes in wood density of the locations described in the input when fire passes through.

The final component, the output service, creates a graph at each time step, and puts together the graphs into a gif file. By using the gif file, users can visualize how fire spreads given the initial inputs.

Design Overview
Simulation ModelThe simulation model

Link to the final version of the paper

Link to the final version of the software demonstration video (hosted on YouTube)

Senior Capstone

with No Comments

Sarcasm Detection Using Neural Nets


Over the last decade, researchers have come to realize that sarcasm detection is more than just another natural language task such as sentiment analysis. Problems like human error and longer processing times pertaining to sarcasm arise because previous researchers manually created features that would detect sarcasm. In an effort to limit these problems, researchers desisted from using the pre-crafted-feature-prediction models and turned to using neural networks to predict sarcasm. To understand sarcasm, one needs to have a bit of background information on the topic, common shared knowledge and also exist in the space in which the sarcastic statement exists. With this in mind, introducing visual aspects of a conversation would help improve the accuracy of a sarcasm prediction model.

Software Demo Video
Software Architecture Diagram

Senior Capstone, Looking Back

with No Comments


An ever-present problem in the world of politics and governance in the United States is that of unfairly political congressional redistricting, often referred to as gerrymandering. One method for removing gerrymandering that has been proposed is that of using software to create nonpartisan, unbiased congressional district maps, and there have been some researchers who have done work along these very same lines. This project seeks to be a tool with which one can create congressional maps while adjusting the weights of various factors that it takes into account, and further evaluate these maps using the Monte Carlo method to simulate thousands of elections to see how ‘fair’ the maps are.

Software Architecture Diagram

As shown in the figure above, this software will create a congressional district map based off of pre-existing datasets (census and voting history) as well as user-defined factor weighting, which then goes under a Monte Carlo method of simulating thousands of elections in order to evaluate the fairness of this new map. The census data is used both for the user-defined factor weighting and for determining the likelihood to vote for either party (Republican or Democrat), which includes race/ethnicity, income, age, gender, geographical location (urban, suburban, or rural), and educational attainment. The voting history is based on a precinct-by-precinct voting history in Congressional races, and has a heavy weight on the election simulation.

Research Paper

The current version of my research paper can be found here.

Software Demonstration Video

A demonstration of my software can be found here.

Senior Capstone

with No Comments

An Integrated Model for Offline Handwritten Chinese Character Recognition Based on Convolutional Neural Networks


Optical Character Recognition (OCR) is an important technology in computer vision and pattern recognition that recognizes text embedded in images. Although the OCR achieved high accuracy for languages with alphabet-based writing systems, its performance on handwritten Chinese text is poor due to the complexity of the Chinese writing system. In order to improve the accuracy rate, this paper proposes an integrated OCR model for Chinese handwriting that combines existing methods in the pre-processing phase, recognition phase, and post-processing phase.


Software Demo Video

Software Architecture Diagram

Phi Nguyen – Senior Capstone

with No Comments


Modern internet architecture faces the challenge of centralized services by big tech companies, which capitalizes on the users’ information. Most of the well-known chat services at the moment have to depend on a third party server which stores the users’ conversation. We also have to face the challenge of regulation, and government authorization. To solve this problem, we propose a peer to peer architecture for video chat that is private to the people involved in the conversation.

Link paper:

Link demo:

CS488 – Abstract

with No Comments

Optical Character Recognition (OCR) is an important technology in computer vision and pattern recognition that recognizes text embedded in images. Although the OCR achieved high accuracy for languages with alphabet-based writing systems, its performance on handwritten Chinese text is poor due to the complexity of the Chinese writing system. In order to improve the accuracy rate, this paper proposes an integrated OCR model for Chinese handwriting that combines existing methods in the pre-processing phase, recognition phase, and post-processing phase.

1 2 3 4 9