Weekly update: Annotated Bibliography (18 Annotated)

with No Comments

Idea 1: Masked Face Detection:

Introduction: Due to the fact that the virus that causes COVID-19 is spread mainly from person to person through respiratory droplets produced when an infected person coughs, sneezes, or talks, it is important that people should wear masks in public places. However, it would be difficult to keep track of a large number of people at the same time. Hence, my idea is to utilize machine learning to detect if a person is wearing a mask or not. Hopefully, this idea can help reduce the spread of the coronavirus.

Citation 1: Detecting Masked Faces in the Wild with LLE-CNNs

  • S. Ge, J. Li, Q. Ye and Z. Luo, “Detecting Masked Faces in the Wild with LLE-CNNs,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 426-434, doi: 10.1109/CVPR.2017.53.
  • Link: https://openaccess.thecvf.com/content_cvpr_2017/papers/Ge_Detecting_Masked_Faces_CVPR_2017_paper.pdf?fbclid=IwAR2UcTzeJsOAI6wPzdlkuMG4NaHMc-b1Gwmf-zl5hD3ueIEfBH-3HOgpMIE
    • Includes the MAFA dataset with 30,811 Internet images and 35,806 masked faces. The dataset can be used for us to train or test our deep learning model.
    • Proposes LLE-CNNs for masked face detection, which we can use as a starting point and as a baseline to reach or beat.
    • To look up: Convolutional Neural Network (CNN)
    • The authors show that on the MAFA dataset, the proposed approach remarkably outperforms 6 state-of-the-arts by at least 15.6%.
    • Check if the authors have published codes to reproduce all the experiment results.

The paper introduces a new dataset for masked face detection as well as a model named LLE-CNNs that the authors claimed to have outperformed 6 state-of-the-arts by at least 15.6%. Fortunately, the dataset is publicly available and is exactly what we are looking for for the problem that we are proposing. 

Citation 2: FDDB: A Benchmark for Face Detection in Unconstrained Settings

The link Github contains the MAFA dataset that has the images of people divided into three main factors: face with mask, face without mask, face without mask but getting blocked by phone, hand, people. This dataset exactly fits with the goal of the research.

Citation 3: Object-oriented Image Classification of Individual Trees Using Erdas Imagine Objective: Case Study of Wanjohi Area, Lake Naivasha Basin, Kenya

  • Link: https://pdfs.semanticscholar.org/67b5/21baf2b8828e13b7fd73ab0108d2cbfa6f8c.pdf
    • The author provide a method named Object-Oriented Image Classification and Image Objective tool which would help us understand more about the method we are going to use for the research 

Although this research target is focusing all about object classification, however it brings up a good background when it comes to image classification. 

Citation 4: Joint Face Detection and Alignment using Multi-task Cascaded Convolution Network

  • K. Zhang, Z. Zhang, Z. Li and Y. Qiao, “Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks,” in IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503, Oct. 2016, doi: 10.1109/LSP.2016.2603342.
  • Link: https://ieeexplore.ieee.org/abstract/document/7553523
    • The author of this paper proposed a cascaded based framework CNNs that multi-task heping in detection face and alignment.
    • Showing the model they have come out with real time performance for 640×480 VGA imagine with 20×20 minimum face size 
    • Contain three main important stages to predict face and landmark location a coarse-to-fine manner as designed cascaded CNNs architecture, online hard sample mining strategy and join face alignment learning 

This paper provides a model to help detect people’s face and alignment in difficult environments due to various poses, illuminations and occlusions. Throughout this paper we can have a bigger picture about what face detection is, what is the difference and how this method can help in detecting a person’s face

Citation 5: RefintFace: Refinement Neural Network for High Performance Face Detection

  • S. Zhang, C. Chi, Z. Lei and S. Z. Li, “RefineFace: Refinement Neural Network for High Performance Face Detection,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2020.2997456.
  • Link: https://arxiv.org/pdf/1909.04376.pdf
    • The authors proposed a face detector named RefineFace that can detect faces in extreme poses or have small size in the background.
    • Extensive experiments conducted on WIDER FACE, AFW, PASCAL Face, FDDB, MAFA demonstrate that our method achieves state-of-the-art results and runs at 37.3 FPS with ResNet-18 for VGA-resolution images.

The paper provides a model that can detect faces with extreme poses or possess small sizes. This can be helpful to us since the first step of our problem is to detect faces.

Citation 6: Very Deep Convolutional Neural Networks for Large-Scale Image Recognition

  • Simonyan, Karen & Zisserman, Andrew. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 1409.1556. 
  • Link: https://arxiv.org/pdf/1409.1556.pdf
    • The authors proposed a Convolutional Neural Network architecture, which was state-of-the-art for the Large Scale Visual Recognition Challenge 2014 (ILSVRC2014).

The architecture can be used for our problem as well if we train the model on our own training set and training loss.

Idea 2: Speaker Recognition:

Introduction: BookTubeSpeech is a newly released dataset for speech analysis problems. The dataset contains 8,450 YouTube videos (7.74 min per video on average) that each contains a single unique speaker. Not much work on speaker recognition has been done using this dataset. My work is to provide one of the first baselines on this dataset for speaker recognition / speaker verification.

Citation 1: Deep Speaker: an End-to-End Neural Speaker Embedding System

  • Li, Chao & Ma, Xiaokong & Jiang, Bing & Li, Xiangang & Zhang, Xuewei & Liu, Xiao & Cao, Ying & Kannan, Ajay & Zhu, Zhenyao. (2017). Deep Speaker: an End-to-End Neural Speaker Embedding System.
  • https://arxiv.org/pdf/1705.02304.pdf
    • The author proposes Deep Speaker, a neural embedding system that maps utterances of speakers to a hypersphere where speaker similarity is measured by cosine similarity.
    • To look up: i-vector paper, equal error rate (EER)
    • Through experiments on three distinct datasets, the authors show that Deep Speaker are able to outperform a DNN-based i-vector baseline. They claim that Deep Speaker reduces the verification EER by 50% relatively and improves the identification accuracy by 60% relatively.
    • Make sure that the datasets that the authors used are publicly available.
    • Fortunately, the authors do publish their codes so we can train and test on the BookTubeSpeech dataset.

The paper presents a novel end-to-end speaker embedding model named Deep Speaker. Although the paper is not new, it is definitely something we can use for our problem since the authors do publish their codes, which are readable and runnable.

Citation 2: Voxceleb: Large-scale speaker verification in the wild

  • Nagrani, Arsha & Chung, Joon Son & Xie, Weidi & Zisserman, Andrew. (2019). VoxCeleb: Large-scale Speaker Verification in the Wild. Computer Speech & Language. 60. 101027. 10.1016/j.csl.2019.101027. 
  • Link: https://www.robots.ox.ac.uk/~vgg/publications/2019/Nagrani19/nagrani19.pdf
    • The author introduce a data set named VoxCeleb which contain 600 speakers over a million real world utterances 
    • Propose a pipeline based on computer vision techniques to create dataset from open-source media including Youtube 
    • A CNN architecture and CNN based facial recognition with methods and training loss functions that even under different conditions still can identify  the speaker’s voices

This research contains a various dataset with a CNN architecture and CNN based facial recognition method used to identify the speaker voice. These methods would be beneficial for the research since BookTubeSpeech also a type of data set from Youtube, which also contains imagination and voice. Also this method might help in solving different cases such as the voices of the speaker getting affected by some others sound such as sound, other human voices. 

Citation 3: X-Vectors: Robust DNN Embeddings For Speaker Recognition

  • D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 5329-5333, doi: 10.1109/ICASSP.2018.8461375.
  • Link: https://www.danielpovey.com/files/2018_icassp_xvectors.pdf
    • The author proposed a speaker recognition system using Deep Neural Network.
    • X-vector is considered state-of-the-art for speaker recognition even up to now.
    • The paper also proposed using PLDA on top of the x-vector embeddings to increase discriminality.

The authors of the paper propose a DNN-based speaker embedding model that is currently state-of-the-art for speaker recognition and speaker verification problems. Hence, it goes without saying that we should use this as one of our models to report results on the BookTubeSpeech dataset.

Citation 4: Generalized End-to-End Loss for Speaker Verification

  • L. Wan, Q. Wang, A. Papir and I. L. Moreno, “Generalized End-to-End Loss for Speaker Verification,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 4879-4883, doi: 10.1109/ICASSP.2018.8462665.
  • Link: https://arxiv.org/abs/1710.10467
    • The authors proposed a new loss function called GE2E that does not require an initial stage of example selection.
    • The new loss function makes the model faster to train while still able to achieve competitive performance.

The paper proposes a new loss function that the authors claim to yield competitive performance but fast to train.

Citation 5: Toward Better Speaker Embeddings: Automated Collection of Speech Samples From Unknown Distinct Speakers

  • M. Pham, Z. Li and J. Whitehill, “Toward Better Speaker Embeddings: Automated Collection of Speech Samples From Unknown Distinct Speakers,” ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7089-7093, doi: 10.1109/ICASSP40776.2020.9053258.
  • Link: https://users.wpi.edu/~jrwhitehill/PhamLiWhitehill_ICASSP2020.pdf 
    • The paper proposes a pipeline for automatic data collection to train speaker embedding models. Using the pipeline the authors also managed to collect a dataset named BookTubeSpeech containing speech audios from 8,450 speakers.
    • The dataset contains mostly clean speech, i.e. no background noises.

The paper proposes a pipeline for large-scale data collection to train speaker embedding models. They also contributed a dataset named BookTubeSpeech that we are mainly going to use for our experiments.

Citation 6: Probabilistic Linear Discriminant Analysis

  • Ioffe S. (2006) Probabilistic Linear Discriminant Analysis. In: Leonardis A., Bischof H., Pinz A. (eds) Computer Vision – ECCV 2006. ECCV 2006. Lecture Notes in Computer Science, vol 3954. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11744085_41
  • Link: https://link.springer.com/content/pdf/10.1007%2F11744085_41.pdf 
    • PLDA is often used on top of output of speaker embeddings model to increase speaker discrimilaity.
    • Normally, paper related to speaker recognition or speaker verification report results both with or without PLDA.

The author proposes Probabilistic LDA, a generative probability model with which we can both extract the features and combine them for recognition. We can use PLDA on top of our models’ outputs to gain an increase in performance.

Idea 3: Sport players prediction result using machine learning:

Introduction:How many yards will an NFL player gain after receiving a handoff?” I will be attending a competition on Kaggle. During the process, Kaggle would provide a dataset of players from different teams, the team, plays, players’ stats including position and speed to analyze and generalize a model of how far an NFL player can run after receiving the ball.  

Citation 1: A machine learning framework for sport result prediction

  • Bunker, Rory & Thabtah, Fadi. (2017). A Machine Learning Framework for Sport Result Prediction. Applied Computing and Informatics. 15. 10.1016/j.aci.2017.09.005. 
  • Link: https://www.sciencedirect.com/science/article/pii/S2210832717301485
    • Even though the paper is about sport result prediction not player performance prediction, it does provide good insights on how to tackle our problem. In particular, the authors provide a framework that we can apply to our problem. 
    • Moreover, each step of the framework is clearly explained with detailed examples. The framework can be used for both traditional ML models as well as for artificial neural networks (ANN).

The paper provides not only a critical survey of the literature on Machine Learning for sport result prediction but also a framework that we can apply to our problem. While the survey can help us get a sense of which method works best, the framework will let us know what to do next after we have picked our model.

Citation 2: Scikit-learn: Machine Learning in Python

  • Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. “Scikit-Learn: Machine Learning in Python.” arXiv.org, June 5, 2018.
  • Link: https://arxiv.org/abs/1201.0490
    • The author introduced a good framework to train traditional Machine Learning models as well as artificial neural networks.
    • The library is in Python, which is one of the languages that I am most familiar with.

Citation 3: Using machine learning to predict sport scores — a Rugby World Cup example

Although this is not an official research, however it contains the step-by-step to do research related to this topic detailly. It also listed all of the tools that are necessary and fit with the topic of the research. 

Citation 4: Long Short-Term Memory

  • Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735-1780.
  • Link: http://citeseerx.ist.psu.edu/viewdoc/download?doi=
    • The authors introduced the Long Short-Term Memory, which is a model that can handle sequence data. 
    • We can definitely expect sequence data in sports data. For example, the number of touchdowns in a season of a NFL player.

For our problem, we will definitely try different deep learning architectures. LSTM is one of the architectures that we are going to try.

Citation 5: XGBoost: A Scalable Tree Boosting System

Not every problem requires deep learning models, we should try traditional Machine Learning techniques as well. Hence, we should try XGBoost.

Citation 6: Principal component analysis

  • Abdi, Hervé, and Lynne J. Williams. “Principal component analysis.” Wiley interdisciplinary reviews: computational statistics 2.4 (2010): 433-459.
  • Link: https://onlinelibrary.wiley.com/doi/abs/10.1002/wics.101
    • Principal Component Analysis (PCA) is among the most effective dimensionality reduction algorithms. 
    • When faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that explain most of the variability in the original set.

The PCA features can also be used as new features that we can feed into our machine learning models.

Leave a Reply