Lam Hoang – CS/DS Student Portfolios

Idea 1: Detect and predict mental health status using NLP

The objective of that research is to determine whether NLP (Natural Language Processing) is useful for mental health treatment. Data sources vary from traditional sources (EHRs/EMRs, clinical notes and records) to social sources (tweets and posts from social media platforms). The main subject for data will be the social media posts because they enable immediate detection of users’ possible mental illness. They are preprocessed by extracting the key words to form an artificial paragraph that contains only numbers and characters. Then, an NLP (Natural Processing Language) model, built on genuine and artificial data, will test the real (raw/original) data and compare with the artificial paragraph above. The outputs will be the correct mental health situation from the patients after the testing.

Libraries such as Scikit and Keras are frequently used in a lot of papers. The dataset I would like to experiment is the suicide and depression dataset from Reddit posts starting at \r.

Idea 2: Using Machine Learning Algorithms to predict air pollution

Machine Learning algorithms can predict air pollution based on the amount of air pollutant in the atmosphere, which is defined by the level of particulate matters (we will look at PM2.5 – the most dangerous air pollutant to human’s health – specifically). Given a data indicating PM 2.5 level in different areas, that data has to be preprocessed. Then, it will undergo several machine learning models, notably random forest and linear regression due to their simplicity and usage for regression and classification problems, to produce the best model for forecasting the presence of PM 2.5 level in the air.

Python and its library Scikit are the most common tools to train models. For this idea, I would like to select a dataset to measure air quality of 5 Chinese cities and it is available on Kaggle.

Idea 3: Machine Learning in music genre classification

Given an input (dataset) of thousands of songs from different genres such as pop, rock, hip hop/rap, country, etc, the audio will be processed by reducing/filtering noise and the soundwave. Then, the neural network model will take those audio inputs and images of their spectrograms to differentiate several objects as their neural network outputs. Eventually, the KNN method, due to its simplicity and robust with large noisy data, will test the music from the model outputs and classify the songs into their respective genres.

Python is mostly involved in the research due to the wide range of usage of its dynamic libraries such as Keras, Scikit, Tensorflow, and even Numpy that support the findings. This idea, I would like to choose Spotify dataset thanks to its rich collection of songs.