Learning Speech Recognition from Tf Speech Commands

Competitions, Projects · 10 Jan 2017 - 2 minutes to read.

At the end of 2017, Google Launched a competition on Kaggle using its dataset Speech command. In this competition we were challenged to predict simple commands from input user speech command. Each utterance is around 2 seconds.

This competition attracted me to gain more experience in Speech Recognition which I believe is like the computer vision image problems the only difference is the input audio which is then transformed to a 2D feature set.

Introduction

When I joined this competition, I started learning more about speech recognition under the supervision of Prof Mohamed Afify.

My first attempt at this contest was to learn about features extracted from input audio (mfcc,mel and log mel) , also I started reading about how to make the model robust to noise in input audio so I worked on the data augmentation part to add white noise to the input training utterances.

To make my experimentation easier I created a python code with config file to make parameter choice of input data feature type, size, model parameters and data augmentation techniques easier.

Dataset

Speech Commands dataset each folder containing a set of wav and the parent folder name specify the keyword.

Used Parameters for Wav Reading

used to specify sampling rate (of input wav files), time shift in millisecond, training clips duration in milliseconds, window size milliseconds , window stride milliseconds and finger print type (mfcc,mel and log_mel) and CTC flag to denote using of CTC based models or not.

Models used

Baseline model as in Convolutional Neural Networks for Small-footprint Keyword Spotting
VGGNET/ResNet working on 2D fingerprints
Character Based CTC speech models as a learning playground, I didn’t expect it to get good results based on the length of the input utterance and the small size of the dataset.

Post Processing

After predicting the command found in the utterance, I tried using a language model to fix the words predicted by the previously trained CTC acoustic model .

For the other trained model, they were trying to predict the command class from each input utterance with unknown class for any other command.

Data Augmentation

Used Data augmentation technique to generalize the model on speech utterances with different speed not found in the train set (this technique successfully improved the model)

operations to be used speed and stretch
percentage of augmentation for each available class

Results

using speech commands test set data

baseline.yml config test accuracy 90.4%
vggnet.yml config test accuracy 92.3%

Kaggle Results

My main focus in this contest was to get into speech recognition, so all the experimentation were for learning purposes and i was able to be top 15% in this large contest with 1315 teams.

Reference

The code used for this contest is at Github Tensorflow-Keyword-Spotting.
On eof the resources used to learn Speech Recognition Stanford CS224S Spring 2017