Investigating Modulation Spectrogram Features for Deep Neural Network-Based Automatic Speech Recognition


Deep neural network (DNN) based acoustic modelling has been shown to yield significant improvements over Gaussian Mixture Models (GMM) for a variety of automatic speech recognition (ASR) tasks. In addition, it is also becoming popular to use rich speech representations, such as full-resolution spectrograms and perceptually motivated features, as input to the DNNs as they are less sensitive to the increase in the input dimensionality. In this work, we evaluate the performance of a DNN trained on the perceptually motivated modulation envelope spectrogram features that model the temporal amplitude modulations within sub-band speech signals. The proposed approach is shown to outperform DNNs trained on a variety of conventional features such as Mel, PLP and STFT features on both TIMIT phone recognition and the AURORA-4 word recognition tasks. It is also shown that the approach outperforms a sophisticated auditory model based on Gabor filter bank features on TIMIT and the channel matched conditions of the AURORA-4 database.

INTERSPEECH, Dresden, Germany
Deepak Baby
Deepak Baby
Applied Scientist

My research interests include speech recognition, enhancement and deep learning.

Hugo Van hamme
Hugo Van hamme

Professor at KU Leuven, Belgium

comments powered by Disqus