I finished my presentation. My next step is to add abstract and more introduction to my proposal paper, and finish the final version of it. I have done more research in the past week and planed to change my modeling method from GMM-UBM to Convolutional Neural Network or Deep Neural Network. GMM-UBM is very classical but also “old-fashioned”. CNN and DNN are newer and better. GMM-UBM’s performance lowers as the amount of speakers increases. But I do not have enough time to change method for this semester. I will do more research during winter break and probably change next semester.
I read some new papers and research about different modeling algorithms and started to worry about the accuracy on my system. The accuracy is not only rely on the modeling but also based on the dataset for training and the quality of acoustic input (the speaking environment). But selecting a suitable modeling algorithm is important. Now the popular models are: HMM, VQ, DTW, GMM, UBM, i-Vector. I temporarily chose hybrid GMM-UBM. I might change in the future or mix other modeling to enhance the accuracy. My goal is to reach an accuracy at least 90%.
I discussed my proposal draft with my advisor. I got her feedback and suggestion, and knew how to revise and improve my proposal. In the past week, I read more papers about the GMM-UBM modeling method that I plan to use for my project. I understood the specific procedure now but it is still hard to fully understand this principle… Now my another problem is to find a suitable dataset and decide if my system is text-dependent. There are three primary ways for speaker verification now: text-dependent, mixed, text-independent. The text-independent way is very difficult and complicated to do because user can say anything to pass the verification. But text-dependent way is restricted and not safe for spoofing attacks. For example, people can replay pre-recorded voice to pass the verification. Therefore, the mixed way is better. It restricts the text in a way but safe for spoofing attacks. For example, they user can only speak numbers one – ten, but every time the text is random. But it is hard to find a dataset of all audio file in numbers in English. Now I need to decide which text way my system will use.
Finished my first draft of proposal. I read some blogs about speaker verification tech and found out that I was wrong on some aspects (actually I was confused). Those blogs help me understand more and deeper about speaker verification. So I revised my framework and flowcharts: take voice input -> feature extraction -> modeling -> database. The modeling part is the most difficult part in speaker verification. The most popular models are: Hidden Markov Model, Gaussian Mixture Model, Vector Quantization, etc. I am not sure which one I will use for sure. It all depends on my dataset and customer need. I need to experiment several models to know which one I want the best. But I chose GMM temporarily on my proposal.
I found a MFCC library in GitHub and explored it a little bit. It directly takes a wav file as input and returns one N*1 array (a sequence of acoustic vector). I recorded my voice and converted to a wav file. I briefly tested the code. It took my wav file and return an array containing a sequence of vectors. I will use this library in my project. But there are many related factors that i need to study. I also wrote the timeline for the rest of this semester and next semester. My next step is to keep working on this MFCC library and explore the Dynamic Time Warping library in GitHub.
I finished my proposal outline. The next step is to write my proposal draft. I also discussed with Xunfei and she help me drew a better flowchart. I gained a clearer understanding about the flow of my project. I downloaded a SDK of the iFlytek company’s voiceprint recognizer product for reference.
I finalized my proposal to “Applying Voiceprint Recognition Technology to Identity Verification”. The keywords are voice recognition, voiceprint, feature extraction, voice detection, voice verification. The difficulty I might encounter is that there may be background noise in the voice input. If the noise is loud, it may affect the feature extraction and voice recognition. I probably need to explore methods for removing noise.
My project is using Voice Print Recognition technology to check if the voiceprint of the input match the corresponding one in the database. This technology can be used in many identity verification scenes like customer services for bank, door lock, business transaction. The main steps approximately will be: take voice input -> (remove noise -> ) extract voice features -> building models with selected algorithms -> compare voice features -> check if voiceprint match. The possible algorithms might be: VQ, MFCC, DTW.
I haven’t decide my topic yet, but I was reading papers related to my three ideas to gain a deeper understanding on these ideas.
My first idea is an AI tech for voice print recognition. It can be used for avoiding voice spoofing attacks on business, banks (like mobile phone customer service), etc. The main steps for voice recognition is: take vocal input -> identity-feature analysis -> deviating feature selection -> deviating feature comparison -> distance to reference pattern estimate -> check if voice match. The main algorithms for feature extractions are: GMM, JFA, GMM-SVM, etc. On the paper “Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech”, the authors experimented several algorithms and concluded that although JFA has a high inaccuracy but the converted samples with JFA sounds very mechanical so human can easily distinguish. The authors of paper “Voice command recognition system based on MFCC and VQ algorithms” discuss and examine two significant modules: MFCC and DTW. Their results were good. So I will consider use these two modules.
For my second idea which is creating an AI tool for safety driving, the key tech is 3D dynamic facial recognition. I learned that the most Facial recognition tech can be decided into two main parts: facial detection and facial recognition. I can use open sources like opencv and dlib to do facial detection. There are 3 factors i need to care about: detection rate, misdetection rate, false alarm rate. The authors of the paper “BP4D-Spontaneous: A high-resolution spontaneous 3D dynamic facial expression database” reported a newly developed spontaneous 3D dynamic facial expression database in their paper. I am not sure if I can or should use their new database. Although the paper “Real time facial expression recognition in video using support vector machines” primary discuss detecting emotion from facial expression, it provides some facial recognition tech info for me.
My third idea is creating a smart tool to grade algebra on handwritten homework. The APP takes a photo of the handwritten homework and using OCR tech to extract the texts and grade them. The main tech is just OCR. Although the two papers I read both talk about their own APP and OCR system, I can refer some technologies they used, like matrix matching, fuzzy logic for facial extraction.
(I was attending a CS conference in California this week so my post is late.) I have read some papers related to my 3 topics. I gained a clear understanding of the technologies I need for my three topics. I also explored the new ideas from Xunfei’s feedback. There are already available APPs that can scan printed music sheet and play the music. Most of them are not free. I only found two free APPs called PlayScore2 and iSeeNotes. PlayScore2 works much better than iSeeNotes. I tested the APP with my printed music sheet, and the result was not as good as I thought. It couldn’t read all the music notes. If I am going on this topic, my goal will be enhancing the accuracy. But scanning and reading hand written music sheet would be very challenging. Even I cannot read those old music sheets very well.
I am still exploring my new ideas.
Idea 1 Title: AI Assistant for Safety Driving
Description: My idea is to use facial recognition to detect fatigue driving or dangerous driving. A small camera will be placed in front of the driver and the AI Assistant will be a mobile phone APP. For fatigue driving, the camera detects behaviors like closing eyes for long time, yarning, frequently rubbing eyes with hand, etc. For other dangerous driving behaviors, the camera can detect behaviors like playing mobile phone, turning back to chat, smoking, etc. After detecting dangerous driving behaviors, the APP will do a series of operations according to the dangerous behavior: giving an alarm, play refresh songs, report the location of the nearest rest stop. Actually there are already research and real products available now, but most of them are commercialized. They are expensive and big-sized. What makes my idea different is that I am going to develop it as a cheap and handy daily life tool. And I probably can enhance the accuracy of detection under dark light or special environment. The main tech is facial recognition and there are many open sources online.
Idea 2 Title: Voice Print Recognition Identifier
Description: Facial recognition is very popular recently, but many people do not notice voice print recognition. VPR is cheaper and more convenient because it only needs one microphone. Voice can tell many information like age, gender, emotion, environment, and etc. Wearing glasses or makeup might affect the performance of facial recognition, but a mature human being’s voice is stable. There are already research and real application on voice print recognition. For example, WeChat user uses voice recognition to log in. But I can innovate my idea on the application, for example, combine it with bank customer service. When a user steals other’s password and calls Chase to do money transaction, the VPR can detect if the voiceprint matches the bank account owner. Or use it to fight crime, home robots, etc. The VPR tech is already exist. I need to innovate from other aspects. But the two primary usages are: speaker identification or speaker verification. Voice print is a spectrogram of voice, forming from wavelength, frequency, intensity and other hundreds of characteristics. I can differentiate people from differences in purity of voice, resonance ways, average pitch characteristic, voice range.
The popular used spectrogram characteristics are: MFCC, PLP, FBank, D-vector by Google, Deep feature, Bottleneck feature, Tandem feature. Existing models: GMM-UBM, JFA, GMM-UBM i-vector, DNN i-vector, Supervised-UBM i-vector. Scoring algorithm: SVM, PLDA, LDA, Cosine Distance.
(But i don’t know if the scope is too big)
Idea 3 Title: Using OCR to grade hand-written homework
Description: Take a picture of the hand-written homework and use OCR tech to extract the content. (OCR is a technology converting printed text into editable text.) Then check if the algebra calculation is correct. Mark the correctness in the picture near the problem. There are already APP like that available online. But it only checks for algebra calculation, and it is not that accurate. It requires good hand-writing. I can expand my project to enhance the OCR tech, or make it check for simple English homework as well. Especially some parents in non english speaking countries cannot speak english to help their children. And probably I can expand the idea to a mobile phone APP that can summarize and analyze the grades data: collect incorrect problems, analyze which category did incorrectly the most, and etc.
Buying tickets of popular concerts: the application imitates real users to buy concert tickets on the website. Users can set it up before the tickets opening date. As long as the tickets are open to sell, the application will immediately buy them. If the tickets are sold out, it keeps reloading the webpage until there are new tickets available.
One time I wanted to buy a concert ticket, but that concert was very popular, and tickets were sold out immediately. Then I kept reloading the website, hoping other people would cancel their orders. But I couldn’t keep reloading the page every second for the whole day. Then I thought if I can have software to do that for me. I did research and found that some web browser automation tools, for example, Selenium, can imitate operation like real users on a web browser. I searched online and found that most of this kind of software buy train tickets instead of concert tickets. The primary technology I will use is Selenium. And I will also need to learn how to read page source code. I will primary write a python script with Selenium to imitate real users buying a ticket online. Selenium can provide a web driver on Google Chrome to go to the target website. And the driver will locate element by tag names, element ID, XPath, class names and etc., and it can do automatic click and input.
But there are several difficulties in my project. The first one is money security. Some ticket websites require payment immediately while making the order. Therefore, users need to provide their website account and bank card information beforehand. I need to be careful about bank card security. The second problem is that this software is website-specific. Different websites require different processes to buy tickets. I need to write different codes to deal with different sites.
The third problem is that some websites require complicated verification on user login. I don’t know if web browser automation can deal with a very complicated verification. The last issue is that I don’t know if the scope of this project is big enough. I can have my software operates for several famous ticket websites and probably can have the mobile APP version.