Jarred – Three Pitches
Pitch 1 – Develop an API for Sensor
I was hoping to work with sensors to develop an interface/app that would be more useful for
individuals’ projects. Say there was this sensor that measured water quality that was inexpensive,
but pretty rudimentary and it doesn’t really have a good API already, I could make an app that
would work with that for individuals who are students or just wanting to do small experiments
with water qualities nearby their house or school. I would work with programming and
interfacing with the hardware sensor. I’m not entirely sure of the dataset I would use or need for
this. I could use one that is already present to work on different graphing techniques in the API.
Pitch 2 – Research into Feasibility and Practicality of Low-Cost Portable Air Quality Sensor and Network with Smartphone Connection
Similar to the sensor pitch I did before, this would be working with air sensors that could be put
into a compact container that would be attached to someone. The sensors would be built into an
Arduino, Charlie said he would show me some sensors and how to use the Arduino sometime
this week when he is back on campus. The air sensors would constantly read for different
dangerous compounds in the air and would notify the user through a smartphone app and track
different amounts in different places. The sensor would not track location, but the app on the
phone would because the sensor would probably be connected to the phone through Bluetooth or
some other possible wireless connection and therefore have to be near the phone to work. This
would be very useful for emergency responders and for maintenance people who work with
boilers and furnaces.
Pitch 3 – Land-Based Oil Spill Modeling
I was thinking about doing something within the realm of climate change simulations. Maybe
something along the lines of the damages of an oil pipeline bursting and then being carried
inland by a flood that was caused by excess rainfall due to climate change. Or perhaps a wildfire
simulation or smoke pollution simulation from said wildfires. This would just help to get an idea
of how we could mitigate these catastrophes while the risk is still present. It would require a lot
of machine learning and GIS information
Lam – CS388 Three Pitches
Idea 1: Detect and predict mental health status using NLP
The objective of that research is to determine whether NLP (Natural Language Processing) is useful for mental health treatment. Data sources vary from traditional sources (EHRs/EMRs, clinical notes and records) to social sources (tweets and posts from social media platforms). The main subject for data will be the social media posts because they enable immediate detection of users’ possible mental illness. They are preprocessed by extracting the key words to form an artificial paragraph that contains only numbers and characters. Then, an NLP (Natural Processing Language) model, built on genuine and artificial data, will test the real (raw/original) data and compare with the artificial paragraph above. The outputs will be the correct mental health situation from the patients after the testing.
Libraries such as Scikit and Keras are frequently used in a lot of papers. The dataset I would like to experiment is the suicide and depression dataset from Reddit posts starting at \r.
Idea 2: Using Machine Learning Algorithms to predict air pollution
Machine Learning algorithms can predict air pollution based on the amount of air pollutant in the atmosphere, which is defined by the level of particulate matters (we will look at PM2.5 – the most dangerous air pollutant to human’s health – specifically). Given a data indicating PM 2.5 level in different areas, that data has to be preprocessed. Then, it will undergo several machine learning models, notably random forest and linear regression due to their simplicity and usage for regression and classification problems, to produce the best model for forecasting the presence of PM 2.5 level in the air.
Python and its library Scikit are the most common tools to train models. For this idea, I would like to select a dataset to measure air quality of 5 Chinese cities and it is available on Kaggle.
Idea 3: Machine Learning in music genre classification
Given an input (dataset) of thousands of songs from different genres such as pop, rock, hip hop/rap, country, etc, the audio will be processed by reducing/filtering noise and the soundwave. Then, the neural network model will take those audio inputs and images of their spectrograms to differentiate several objects as their neural network outputs. Eventually, the KNN method, due to its simplicity and robust with large noisy data, will test the music from the model outputs and classify the songs into their respective genres.
Python is mostly involved in the research due to the wide range of usage of its dynamic libraries such as Keras, Scikit, Tensorflow, and even Numpy that support the findings. This idea, I would like to choose Spotify dataset thanks to its rich collection of songs.
Pitch 1
Social media sentiments to predict mental health of people during the stages of pandemic in the United States
- I think it is important to know how people are feeling, whether they are hopeful, devastated, excited, or feeling fine during the pandemic.
- With new variants showing up, people’s sentiments may keep changing. So, I want to see if virus variants, mask mandates loosening up, or vaccine rollout had an impact on people’s sentiments. For example, people were probably being hopeful and excited after getting vaccinated but the delta variant may have increased the negative sentiments.
- I plan to use Twitter data with a sentiment analysis machining learning model and then visualizing the data with interactive charts
- I will be using Twitter’s API code and collect the tweets with #covid19, #pandemic, and other covid related hashtags (#).
- Timeline of the covid-19 event (From the timeline below I will only pick certain events): https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020
- Resources: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3572023.
Pitch 2
Social media sentiments to predict the covid situation and vaccine rollout in different countries
- Why is this important?
- ● In the US people may be more hopeful and excited since the vaccination rate is going up and mask mandates are getting removed.
- ● However, there are parts of the world where the majority of the population is not vaccinated, and a lot of people are dying of covid.
- ● So, it would be interesting to predict the parts of the world with higher vaccination rates and better situations and parts of the world with lower vaccination rates and worse situations through Twitter sentiments.
- I plan to use Twitter data with a sentiment analysis machining learning model and then visualizing the data with interactive charts
- I will be using Twitter’s API code and collect the tweets with #covid19, #pandemic, and other covid related hashtags (#) and categorize these according to country names.
- I might cherry-pick 5-6 countries only.
- Resources: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3572023
Pitch 3
Predicting the tourist volume and touristic behavior with search engine data
- To be able to predict the volume of tourists arriving at a certain destination is important in order to plan and make available adjustments according to the demand.
- Especially during this global pandemic, it is very helpful to know the volume of the tourist arriving at the destination so that there can be proper and enough arrangements for testing, quarantining, and knowing whether the area is going to become a high risk for covid.
- I will be using google trends data, and use keywords related to tourism to predict the volume of tourists.
- I will be doing this for only 5 countries.
- To see if my predictions are correct I will test them with available data from the past.
- Data for tourism volume -> https://ourworldindata.org/tourism https://www.statista.com/statistics/261733/countries-in-asia-pacific-region-ranked-by-internationa l-tourist-arrivals/
Resources:https://www.sciencedirect.com/science/article/abs/pii/S2211973617300570
CS 388: Three pitches
Cryptocurrency price prediction
This would predict crypto currency prices using deep learning. As cryptocurrency popularity is increasing in the modern age and the money flow is increasing this is making cryptocurrencies more volatile and the patterns are changing. Some of the problems faced are as unlike the stock market cryptocurrencies are dependent on factors such as its technological progress and internal competition etc. I plan to get data from news agencies about specific tokens and also data of all the price changes in crypto from 2012 to predict the future prices. As per my research this would require the use of LSTM neural networks. Some of the many places this data can be found is on Cryptocompare API: XEM and IOT historical prices in hour frequency, Pytrends API: Google News search frequency of the phrase “cryptocurrency”, Scraping redditmetrics.com: Subreddit “CryptoCurrency,” “Nem,” and “Iota” subscription growth. We can predict the price by Identifying the Cointegrated Pair. This is a popular method used to stationarize time series. It can be used to remove trends and seasonality. Taking the difference of consecutive observations is used for this project.
Gender and age detection using deep learning
This would predict the age and gender of a person using a picture of a person or live view using a webcam. The predicted gender may be one of ‘Male’ and ‘Female’. It is very difficult to accurately guess an exact age from a single image because of factors like makeup, lighting, obstructions, and facial expressions.I will be using the Adience dataset; the dataset is available in the public domain. This dataset serves as a benchmark for face photos and is inclusive of various real-world imaging conditions like noise, lighting, pose, and appearance. As there are already multiple studies done on this topic, factors which affect the efficiency of the program can be worked on.
Forest fire detection using k-clustering
This model would detect forest fires using the Keras Deep Learning library. As seen recently around the world in places such as the Amazon rainforest and a prominent part of Australia, wildfires are increasing in this era. These disasters are damaging to the ecosystem like damaging habitat and releasing carbon dioxide. This project can be built using k-means clustering. This model would be able to identify any forest fires hotspots along with the intensity of the fire at that particular spot which would result in either the model detecting if it’s a wildfire or not. There is another way of making this using the neural network MobileNet-V2 or U-net which is more efficient and I will be researching more on this. There is a data set compiled with over 1300 images that would be used to detect wildfires. The data for this project can be found at https://drive.google.com/file/d/11KBgD_W2yOxhJnUMiyBkBzXDPXhVmvCt/view.
CS 388: The first three pitches
A new voting ensemble scheme
- Voting is a popular technique in Machine Learning to aggregate the predictions of multiple models and produce a more robust prediction. Some of the most widely used voting schemes are majority voting, rank voting, etc. I hope to propose a new voting system that particularly focuses on solving the issue of overfitting by using the non-conflict data points to inform the prediction of data points where conflict does arise.
- Evaluation:
- Compare its overall performance with that of the popular voting schemes
- Examine ties → see if it’s better than flipping a coin
- Apply statistical hypothesis testing to these analyses
- Possible datasets to work with:
- Any dataset whose target is categorical (i.e. it’s a classification problem). Preferably, the features are numerical and continuous.
Comparing the performance of the different Hyperparameter Tuning methods
Hyperparameter Tuning is an important step in building a strong Machine Learning model. However, that the hyperparameter space grows exponentially and the interaction among the hyperparameters is often nonlinear limits the number of feasible methods to come up with a more optimal set of hyperparameters. I plan to examine some of the most common methods that are often used to tackle this problem and compare their performance:
- Grid Search
- Random
- Sobol (hybrid of the two aforementioned methods)
- Bayesian Optimization
To many people’s surprise, the Random brute-force technique sometimes outperforms the Grid Search method. My project aims to verify this claim by applying the techniques above to a range of benchmark datasets and prediction algorithms.
The referee classifier
This is another Voting ensemble scheme. We pick out the classifier that performs the best under conflict and give it the role of the referee to solve “dispute” among the classifiers. The same principle can be used for breaking ties but we can also try removing the classifier that performs the worst under conflict.
We can try out a diverse set of classification algorithms like Decision Tree, Support Vector Machine, KNN, Naive Bayes, Logistic Regression, etc. and run them on the benchmark datasets from UCI. This proposed voting scheme can then be compared against the more common Simple Majority Voting Ensemble approach in terms of accuracy and other performance metrics.