The finalized poster for EPIC Expo is attached here, on the addition of new training samples to a decision tree.
This past week I set up a git repo for my senior research, and am now working to create a system diagram for my project.
In addition, this week I will be working through the source code of scikit-learn to better understand how they create their decision trees, and identify which parts of the extensive source code are required for my project, so as to reduce the space required for the research.
For this week, I reviewed the source material on decision tree creation, attempting to further my understanding, and think about where to begin editing.
For next week, I will be setting up a git repo with the source code from sci-kit learn, and begin to work on the design of how to add more data.
In machine learning, there has been a shift away from focusing on the creation complex algorithms to solve problems. Instead, a large focus has been on the application of simpler algorithms which learn from datasets. This shift has been made possible through the ever-increasing computational power of modern computers, and the massive amounts of data generated and gathered through the internet of things. However, even given the power and storage cheaply available for creating these models, it can still be quite time and space intensive to make a useful machine learning model. Datasets can vary in size, but can range from hundreds of thousands, millions, or even billions of unique data points. Due to the copious amount of data, training even a relative fast machine learning model can take hours or days. Because of how time and resource intensive that process is, companies often wait to recreate the model, even though they may lose some performance due to it. Some companies even wait to do it on a monthly basis, such as CoverMyMeds, who update their models every 28 days.
Part of why updating models is so intensive is that many do not allow data to be added after they are initially trained. This means each time you want to add data, you must create a new version from scratch, using the old set and the new points. Other types of models do allows this though, so it is possible to add it. The aim of my research focuses on learning how to add data dynamically to the model from neural networks, a machine learning algorithm based off of how the brain works with neurons, and apply similar logic to classification decision trees. The hypothesis of my research is that the time intensity of updating a decision tree can be decreased by adding data incrementally, with little loss to the tree’s effectiveness.
For the paper associated with this research, I will focus on the theory behind neural networks and their dynamic data addition, how decision trees are created, and how I will be adapting their training to mimic the behavior of neural networks when it comes to training. However, it may not be the case that decision trees can be changed to act as neural networks do, but can be edited in some other manner. To confirm or discredit my hypothesis, the resulting software will be tested on a series of datasets which range in size, type, and topic, and recorded in the paper.
For the software component of this research, I will be reviewing, editing, and testing the sci-kit learn package in Python, which comes with well-tested and documented implementations of both decision trees and neural networks. These will be gathered into a Git repository, along with the relevant datasets, my edited version of the code, and the necessary files to run to test the results.
My senior research is on adapting how decision trees in Python add data, to allow them to grow further after their initial creation. By allowing data to be added later, we can greatly reduce the time required to update a model in production. Please find attached a copy of the proposal for this research.