Annotated Bibliography

with No Comments
  1. Pitch #1: “Online Child Safety: Using deep learning to detect inappropriate video content for children.”: 

In today’s digital world, where online content is easily accessible, my project aims to use deep learning and machine learning techniques, such as Neural Networks, to tackle the challenge of inappropriate content on video platforms like YouTube. Using advanced image processing and pattern recognition, my goal is to detect and flag unsuitable imagery and audio within videos. With a focus on creating a child-safe environment, I aim to build a comprehensive dataset of video file tags, making online platforms more secure for users of all ages and fostering a responsible and enjoyable digital experience. 

Research question: To improve the accuracy and efficiency of content classification on YouTube, can I compare and evaluate the performance of different deep learning architectures, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) versus human judgment, in the context of detecting inappropriate content? What are the trade-offs in terms of computational resources and real-time processing?

Datasets:

https://zenodo.org/record/3632781 (Restricted)

https://figshare.com/articles/dataset/The_Image_and_video_dataset_for_adult_content_detection_in_image_and_video/14495367/1 (adult content detention)

YouTube Data API (metadata)

Good Resources:

https://cbw.sh/static/pdf/tahir-asonam19.pdf
  1. Bringing the Kid back into YouTube Kids: Detecting Inappropriate Content on Video Streaming Platforms
[1] Tahir, Rashid, Faizan Ahmed, Hammas Saeed, Shiza Ali, Fareed Zaffar, and Christo Wilson. 2019. “Bringing the Kid Back into YouTube Kids: Detecting Inappropriate Content on Video Streaming Platforms.” In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 464-469.

  • Nowadays, YouTube (YT) has become a very popular platform used by a lot of people, including children. YT has a section for children called YT Kids, but sometimes inappropriate children’s content may be leaked onto the platform. In this research, they collected data manually from a range of videos and used deep learning as a filter mechanism.
  • The researchers designed and implemented a deep learning model just for the task of content classification. They used a variety of deep learning models, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), to analyze elements of the content of the videos, including visual, audio, and motion. These model were chosen for their ability to learn complex patterns from raw data.
  • The management of video data is more complex because there are more elements to evaluate than a photo, for example, so they used a pre-trained CNN (VGG19) to extract relevant visual features from individual frames in a video. They used a type of RNN Bidirectional Long short-term memory (LSTM) to capture temporal relations in the motion features in videos, and finally, to process the audio of the videos, they used spectrograms to extract relevant audio and then deep learning models can analyze audio features to detect inappropriate language, words, etc. 
  • The way this research processed and classified their data was that the extracted motion, audio, and visual elements of each scene were put together into a single feature vector for that scene. Then, this single feature vector is fed into a fully connected neural network layer, followed by a softmax layer for classification. Deep learning models can effectively process and combine all these features to make decisions about the content of each scene. 
  • This paper is relevant to my pitch because my project aims to use deep learning and machine learning for video analysis, which is the research of this paper which also deals with the issue of content filtering and detection. 
  1.  A Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos
[2] Yousaf, Kanwal, and Tabassam Nawaz. 2022. “A Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos.” IEEE Access 10 (2022): 16283-16298.

  • This paper talks about YT being a platform that attracts malicious uploaders who share inappropriate content targeting children. To address this issue, this article proposes a deep learning-based architecture for detecting and classifying inappropriate content video by using an ImageNet pre-trained CNN model, specifically EfficientNet-B7 to extract video descriptors. This descriptor is then processed by a bidirectional long-short-term memory (BiLSTM) network to learn effective video representation and video classification. An attention mechanism is also incorporated into the network. 
  • Their proposed model uses a dataset of cartoon clips collected from YouTube videos, and the results are compared with traditional machine learning classifiers.
  • This paper’s methodology is processed in three steps. Video processing in which YouTube videos are split into small clips and labeled through manual annotation. Then, the CNN model and pre-trained on ImageNet is used to extract deep features from processed video frames. These features are also used to feed the BiLSTM network. The extracted videos are processed by BiLSTM Network to learn video representations. The output is then used for multiclass video classification.
  • The data set of this paper contained a total of 111,561 video clips. Out of that number, 57,908 clips were labeled as safe, 27,003 as sexual nudity, and 26,650 as fantasy violence. This distribution was made to ensure a balanced dataset for training and evaluation. 
  • The researchers compared different classifiers, such as Support Vector Machines (SVM), k-nearest Neighbors (KNN), and Random Forest that use EfficientNet. They found out that these machine learning classifiers were outperformed by their deep learning model.
  • This article mentions that for future work, combining temporal and spatial features and increasing the number of classifications, their model can be even better.
  • My project focuses on the same problem domain as the article’s. My pitch attempts to use machine learning and deep learning as well. This article could potentially provide me with guidance for dataset collection.
  • This article cited the previous article, “Bringing the Kid Back into YouTube Kids: Detecting Inappropriate Content on Video Streaming Platforms.”
  • This article is directly related to my area of research. I could use as a framework some of the methodologies from this paper because it is related to the use of the CNN model. 
  1. Evaluating YouTube videos for young children
[3] Neumann, Michelle M, and Christothea Herodotou. 2020. “Evaluating YouTube Videos for Young Children.” Education and Information Technologies 25 (2020): 4459-4475.

  • This article discusses the evaluation of YT videos for small children. The main concern of this article is about the quality of the appropriate quality of video content for children. This paper proposes to provide a framework for evaluating the quality of YT videos for young children, saying that this information can be used by educators and YouTube creators. I found this information relevant to my pitch because it can provide me guidance for video classification.
  • The article discussed YT official labeling for the classification of children’s content. The author proposes that these classifications do not necessarily study how children learn from screen media.
  • This study studies children’s interaction with screen media, develops a rubric for YT videos, and uses this rubric on YT videos.
  • The rubric they made is based on criteria related to age appropriateness, content quality, design features, and learning objectives.
  • They tested five videos with different content by human judgment. 
  • This paper can provide a framework to understand a clear understanding of what is appropriate for children on YT videos.
  • This paper discusses the use of Cohen’s Kappa to measure inter-rater reliability. In my project, I will probably be dealing with a large dataset, I can use a similar statistical method to asses between automated judgment and human judgment. 
  1. A Deep-Learning Framework For Accurate And Robust Detection Of Adult Content 
[4] Kusrini, Kusrini, Arief Setyanto, I Made Artha Agastya, Hartatik Hartatik, Krishna Chandramouli, and Ebroul Izquierdo. 2022. “A Deep-learning Framework for Accurate and Robust Detection of Adult Content.” Journal of Engineering Science and Technology. Engg Journals Publication.

  • This paper highlights the importance of filtering sensitive media, such as pornography and explicit content in general, on the internet. It talks about the negative consequences of exposing children to this content. 
  • The paper discusses old alternatives to solve this issue, such as IP-based blocking, textual analysis, and statistical color models for skin detection. However, this paper introduces the idea of using and transitioning to deep learning. To go from handcrafted features to deep neural networks. 
  • This paper proposes to use the deep-learning framework and spatial and temporal characteristics of video sequences for adult content detection. The framework is based on CNN architecture called Inception-v3 for spatial feature extraction. Temporal characteristics are modeled using long-term short memory (LSTM). The authors used different deep-learning network architectures such as VGGNET, RESNET, and Inception. Inception-v3 was selected to be the most efficient in the feature extraction. 
  • The Dataset is the NPDI dataset, which contains explicit (pornographic) content as well as non-explicit content. This dataset is used for training and classification. 
  • After extracting features from images or video frames using color distribution, shape information, skin likelihood, etc., clustering techniques (k-means) and classifiers (Support Vector Machines, SVM, etc.) are used to separate explicit from non-explicit content. 
  • They used HueSIFT and space-time interest points with SVM and Random forest for better classification using statistical machine learning algorithms. 
  • The paper also mentions that deep learning models such as AlexNet, GoogleLeNet have higher accuracy compared to traditional machine learning methods. 
  • The results said that the deep leaning based approach outperforms previous methods and achieves high accuracy, mostly using LSTM networks. 
  • This paper has great visuals that may help me better orient when creating the visuals and graphs for my paper. I probably won’t be using the dataset from this paper, but it is a great source to have a better understanding of how LSTM works and how could implemented in CNN architecture.
  1. Disturbed YouTube for Kids: Characterizing and Detecting Inappropriate Videos Targeting Young Children
[5] Papadamou, Kostantinos, Antonis Papasavva, Savvas Zannettou, Jeremy Blackburn, Nicolas Kourtellis, Ilias Leontiadis, Gianluca Stringhini, and Michael Sirivianos. 2020. “Disturbed YouTube for Kids: Characterizing and Detecting Inappropriate Videos Targeting Young Children.” In Proceedings of the International AAAI Conference on Web and Social Media, 14:522-533.

  • This paper mentions that many YT channels are for young children, but there is a significant amount of inappropriate content that is also targeted at young children.
  • YT recommendation system sometimes suggests not the best content for children.
  • This research collected a dataset of videos, including inappropriate and for children as well as random videos, and classified the videos as suitable, disturbing, restricted, or irrelevant.
  • A deep learning classifier is developed to detect disturbing videos automatically. 
  • The dataset is collected using YT Data APU and multiple approaches to obtain seed videos. Manual annotation is performed to label videos. 
  • This project starts with the seed videos as a starting point. These seed videos are videos that are appropriate for young children. Then, randomly choose from the recommended videos and choose the videos recommended by the platform. Then, a trained binary classifier is used to predict if the recommendation is appropriate or not. Keep track of whether the next video is appropriate or not. They analyze these random walks and product statistics. 
  • The authors trained a deep-learning classifier to classify the videos automatically. To train the classifier, the authors used a labeled dataset of videos.
  1. Very Deep Convolutional Networks For Large-Scale Image Recognition
[6] Simonyan, Karen, and Andrew Zisserman. 2014. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv preprint arXiv:1409.1556.

  • This paper talks about the impact of convolutional neural network (ConvNet) depth in the context of large-scale image recognition. It explores the use of deep networks with small (3×3) convolutional filters and shows that increasing network depth improves performance.
  • They found that using small (3×3) convolutional filters and pushing the network depth to 16-19 layers can lead to a huge improvement. 
  • This paper focuses on ConvNets architecture and explores increasing the depth. 
  • During the paper, the authors argue their decisions about input size, convolutional layers, etc. The training process involves optimizing the multinomial logistics regression objective using mini-batch gradient descent with momentum. Regularization techniques such as weight decay and dropout are used.
  • To test, they used a trained ConvNet as input images for classification. The input is rescaled to a predefined smallest side as Q, which is a test scale.
  • The network is applied densely over the rescaled test image. The fully connected layers are converted to convolutional layers, and the result is a fully convolutional network is applied to the whole uncropped image.
  • This paper, very similar to other papers, will provide me with a better understanding of using CNN models on a big data set of videos. I want to review this paper to study how they worked increasing the depth of the layers.

2. Pitch #2: Machine learning for music recognition: 

This project aims to use machine learning for music genre recognition. In a world where music is everywhere, music recognition can be a bit of a challenge. I will use machine learning techniques to teach the system to recognize unique patterns and characteristic that defines the music genre.  I may try to duplicate already well-known models for music recognition and compare two of them to see which one is more efficient. 

Research question: To optimize music genre recognition using machine learning, can we compare and evaluate the performance of traditional machine learning classifiers, such as logistic regression and support vector machines (SVMs), with deep learning architectures like convolutional neural networks (CNNs) using a diverse dataset? How do these algorithms perform in terms of accuracy?

Data sets:

http://millionsongdataset.com/
https://www.kaggle.com/datasets/saurabhshahane/music-dataset-1950-to-2019
https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge
  1. A Machine Learning Approach to Musical Style Recognition
[1] Dannenberg, Roger B, Belinda Thom, and David Watson. 1997. “A Machine Learning Approach to Musical Style Recognition.” Carnegie Mellon University.

  • This article is about the applications of machine learning for music recognition. The main problem of this article is the challenge of making computers understand and perceive music beyond low-level features like low-level and tempo.
  • There is a lot of avoidance in research about high-level inference, so it is a challenge to build music-style classifiers. 
  • This article proposes the development of a machine learning classifier for music style recognition. 
  • The data collection in this research is by recording trumpet performances in different styles and then labeling them according to the style. 
  • The machine learning techniques used are naive Bayesian classifier, linear classifier, and neural networks to build the style classifier. 
  • They trained the classifier a portion with the data and then the rest of the data.
  • There are a lot of challenges because music can be multifaceted. Also, selecting features by trying to capture the essence is not easy.
  • There are a lot of overlapping styles in music.
  • The training of the data can be time-consuming and requires a lot of human effort.
  • This article could be relevant to my work to provide context and framework. 
  1. Music Genre Classification using Machine Learning Algorithms: A comparison
[1] Chillara, Snigdha, AS Kavitha, Shwetha A Neginhal, Shreya Haldia, and KS Vidyullatha. 2019. “Music Genre Classification Using Machine Learning Algorithms: A Comparison.” International Research Journal of Engineering and Technology 6, no. 5 (2019): 851-858.

  • They built multiple classification models and trained them over the Free Music Archive (FMA) dataset without hand-crafted features. 
  • To be able to classify the genre of a song, previous work had used both Neural Network (NN) and machine learning. To use NN, this has to be trained end to end using spectrograms of the audio signals. Machine Learning algorithms, like logistic regression and random forest, use hand-crafted features from the time and frequency domains. The manually extracted features like Mel-Frequency Cepstral Coefficients (MFCC), Chroma Features, Spectral Centroid, etc., are used to classify the music into its genres using ML algorithms like Logistic Regression, Random Forest, Gradient Boosting (XGB), Support Vector Machines (SVM). The VGG-16 CNN model gave the highest accuracy.
  • With deep learning algorithms, we can achieve the task of music genre classification without hand-crafted features. 
  • Convolutional neural networks (CNNs) are a great choice for classifying images. The 3-channel (R-G-B) matrix of an image is given to a CNN, which then trains itself on those images. 
  • In this project, the sound wave can be represented as a spectrogram, which can be treated as an image as spectrograms using Short Time Fourier Transform (STFT). CNN will process the spectrograms by capturing patterns. There are a lot of elements that will be extracted: statistical moments, zero-crossing rate, root mean square energy, tempo, etc. 
  • CNN accuracy was about 88%, but models such as LR and ANN may have higher accuracy.
  • The used dataset is from a Free Music Archive dataset. 
  • This paper is relevant to my project because provides a very detailed methodology in the research that I may use as guidance. 
  1. Music instrument recognition using deep convolutional neural networks
[1] Solanki, Arun, and Sachin Pandey. 2022. “Music Instrument Recognition Using Deep Convolutional Neural Networks.” International Journal of Information Technology 14, no. 3 (2022): 1659-1668.

  • Identifying musical instruments within many instruments is very complicated. The research uses a deep CNN to try to achieve this task. The music data set of the instrument is labeled and entered into the network and can estimate multiple instruments from audio signals of many lengths. 
  • Mel spectrogram representation is used to convert audio data into matrix format.
  • The neural network in this research is formed of 8 layers. The softmax function is also used to provide higher chances of identification. 
  • They use CNN for its convolutional layers, pooling layer, etc.
  • This article could be relevant to my project because the techniques for the extraction of relevant data from the music data set could help me in editing my data. Also, the information about deep learning, including CNN could be relevant as a guide. I could explore a similar research but using a different model.
  • The conversion of audio data into spectrograms is also very needed information for my project. 
  1. Music genre classification and music recommendation by using deep learning
[1] Elbir, Ahmet, and Nizamettin Aydin. 2020. “Music Genre Classification and Music Recommendation by Using Deep Learning.” Electronics Letters 56, no. 12 (2020): 627-629.

  • This paper talks about the importance of music in people’s lives and the need to classify music by genre. 
  • This paper reviews preview work on music classification, including Time-Frequency analysis, Mel Frequency Cepstral Coefficients, wavelet transformations, and support vector machines but the authors introduce a convolutional neural network for extracting features from raw music spectogram and mel scpectogram. They compared the performance of CNN-based methods with traditional processing methods.  
  • In this paper, the author proposes MusicRecNet, a new CNN-based model for music genre classification.
  • They claimed that this model outperformed their previous classifier. 
  • The dataset used in this research is the GTZAN dataset, which contains 1000 music samples from ten genres to evaluate the model.
  • Each music sample is divided into six 5-second parts and generates Mel Spectograms. MusicRecNet, with three layers and additional features such as dropout, is trained on these spectrograms. 
  • They used various classification algorithms such as MLP, logistic regression, random forest, LDA, KNN, and SVM applied to the vectors. 
  • The accuracy of MusicRecNet is 81.8%, but when used with SVM, it is 97.6%. 
  1. Music Genre Classification: A Comparative Study between Deep-Learning and Traditional Machine Learning Approaches
[1] Lau, Dhevan S, and Ritesh Ajoodha. 2022. “Music Genre Classification: A Comparative Study Between Deep Learning and Traditional Machine Learning Approaches.” In Proceedings of the Sixth International Congress on Information and Communication Technology: ICICT 2021, London, Volume 4, 239-247. Springer.

  • This paper compares the deep learning convolutional neural network approach with five traditional off-the-self classifiers using spectrograms and content-based features. This experiment uses GTZAN dataset, and the result is of 66% accuracy.
  • The paper introduces the importance of music and the role of genres in categorizing music. Music genre classification is identified as an Automatic Music Classification problem and part of Automatic Music Retrieval. 
  • The study uses automatic music genre classification using spectrogram images and content-based features extracted from audio signals. It uses deep learning CNN and traditional classifiers such as logistic regression, k-nearest neighbors, support vector machines, random forest, and multilayer perceptrons. 
  • The dataset has 1000 songs and is 30 seconds long. It includes raw audio files, extracted mel frequency cepstral coefficients spectrograms, and content-based features. 
  • To train the data, the data set was duplicated and divided into 10,000 3-second song pieces. 
  • The spectrogram size was 217×315 pixels, and 57 features were selected, such as chroma short time Fourier, root mean square error, spectral centroid, Harmony, Tempo, zero crossing rate, etc. 
  • Then CNN was used for deep learning, which consists of an input layer followed by five convolutional blocks. Each block had specific layers. They used a 2D matrix in a 1D array. 
  • The traditional machine learning models were implemented by using Scikit Learn library. 
  1. Multimodal Deep Learning for Music Genre Classification
[1] Oramas, Sergio, Francesco Barbieri, Oriol Nieto Caballero, and Xavier Serra. 2018. “Multimodal Deep Learning for Music Genre Classification.” Transactions of the International Society for Music Information Retrieval 1 (1): 4-21. Ubiquity Press.

  • This article discusses music genre classification using a multimodal deep-learning approach. They aim to develop a system that can automatically assign genre labels to music based on different types of data, including audio tracks, text reviews, and cover art. 
  • The authors proposed a 2 step approach: 1) to train a neural network for each modality (audio, text, etc.) on the genre classification and extract intermediate representations from each network and combine them in a multimodal approach.
  • Audio representation is learned from audio spectrograms using CNN,  text from music-related texts using a feedforward network, and visual uses a residual network. 
  • The model uses weight matrices and hyperbolic tangent functions to embed audio and visual representations into the shared space. 
  • The dataset used is The million song dataset, which consists of metadata. The data is split into training, validation, and test sets.
  • According to the authors, combining the three modalities outperforms individual modalities. 

Leave a Reply