Senior Capstone

with No Comments

Paper

Code

Software overview

Abstract

Writer identification based on handwriting plays an important role in forensic analysis of the documents. Convolutional Neural Networks have been successfully applied to this problem throughout the last decade. Most of the research that has been done in this area has concentrated on extracting local features from handwriting samples and then combining them into global descriptors for writer retrieval. Extracting local features from small patches of handwriting samples is a reasonable choice considering the lack of big training datasets. However, the methods for aggregating local features are not perfect and do not take into account the spatial relationship between small patches of handwriting. This research aims to train a CNN with triplet loss function to extract global feature vectors from images of handwritten text directly, eliminating the intermediate step involving local features. Extracting global features from handwriting samples is not a novel idea, but this approach has never been combined with triplet architecture. A data augmentation method is employed because training a CNN to learn the global descriptors requires a large amount of training data. The model is trained and tested on CVL handwriting dataset, using leave-one-out cross-validation method to test the soft top-N, hard top-N performance.

Software Architecture

Workflow

I’m using CVL writer database to train the model. All handwriting samples go through the data augmentation and pre-processing step to standardize the input for CNN. The samples in the training set get augmented, whereas only one page is produced per sample for the test set. The triplets of samples are chosen from each batch to train the CNN. The output of the CNN is a 256D vector. In order to evaluate the model, we build a writer database for samples in the test set.

Data Augmentation

Each handwriting sample goes through the same set of steps:
1. Original handwriting sample.

2. Sample is segmented into words.

3. The words from a single sample are randomly permuted into a line of handwriting. The words are centered vertically.

4. Step 2 is repeated L times to get L lines of handwriting. These lines are concatenated vertically to produce a page.

5. A page is then broken up into non-overlapping square patches. The remainder of the page is discarded. The resulting patches are resized to 224×224 pixels.

6. Steps (4) and (5) are repeated N times.

7. Finally we apply binarization. The patches are thresholded using adaptive Gaussian Thresholding with 37×37 kernel.

CNN framework

The CNN model consists of 3 convolutional blocks followed by a single fully connected layer. Each convolutional block includes a 2D convolutional, batch normalization, max pooling and dropout layers. The final final 256D output vector is normalized. I implemented this CNN framework in keras with tensorflow backend.

The model was trained for 15 epochs with batch gradient descend and Adam optimizer, with an initial learning rate of 3e-4. 10 epochs of training with semi-hard negative triplet mining was followed by 5 epochs of hard negative triplet mining.

Pitches

with No Comments

  • Use deep reinforcement learning to tune the hyperparameters (learning rate, lambda – regularization parameter, number of layers, number of units in each layer, different activation functions) of a Neural Network. The overall cost function of RL agent will include the metrics such as accuracy of the NN (or F1 score) on training and validation sets, time taken to learn, the measures of over/underfitting. This network would be trained on different types of problems.
  • For this idea, I’m using the game of Pong (ATARI) as a test environment. My plan is to introduce a specific pipeline in training the AI agent to play the game. Instead of directly using the Policy Gradients, I will train the agent to guess the next frames in the game. First, I will use RNN to learn (approximate) the transition function in an unknown environment. The transition function, modeled by a Recurrent Neural Network, will take previous n states of the game(in raw pixel form) and agent’s action, and output the state representation that corresponds to the future state of the environment. The intuition behind this is that the agent will first learn the ‘laws of physics’ of a certain environment (exploration) and this will help the agent learn how to play the game more efficiently. After learning the weights of the transition function, I will implement the Reinforcement Learning algorithm (Policy Gradients) that reuses the learned weights (transfer learning) and train this deep neural network by letting in play a number of games and learn from experience.
  • I will train a CNN to be able to verify, given the images of handwritten text, if two handwritings belong to the same person. In order to generate more labeled data, I will use a dataset with images of handwritten texts and break up each image into the windows containing a few words. I will assume that each word written on a single image belongs to one person.