Editors: Javier Ramírez, Juan Manuel Górriz

Recent Advances in Robust Speech Recognition Technology

eBook: US $21 Special Offer (PDF + Printed Copy): US $136
Printed Copy: US $119
Library License: US $84
ISBN: 978-1-60805-389-6 (Print)
ISBN: 978-1-60805-172-4 (Online)
Year of Publication: 2011
DOI: 10.2174/9781608051724111010


This E-book is a collection of articles that describe advances in speech recognition technology. Robustness in speech recognition refers to the need to maintain high speech recognition accuracy even when the quality of the input speech is degraded, or when the acoustical, articulate, or phonetic characteristics of speech in the training and testing environments differ. Obstacles to robust recognition include acoustical degradations produced by additive noise, the effects of linear filtering, nonlinearities in transduction or transmission, as well as impulsive interfering sources, and diminished accuracy caused by changes in articulation produced by the presence of high-intensity noise sources. Although progress over the past decade has been impressive, there are significant obstacles to overcome before speech recognition systems can reach their full potential. Automatic speech recognition (ASR) systems must be robust to all levels, so that they can handle background or channel noise, the occurrence on unfamiliar words, new accents, new users, or unanticipated inputs. They must exhibit more 'intelligence' and integrate speech with other modalities, deriving the user's intent by combining speech with facial expressions, eye movements, gestures, and other input features, and communicating back to the user through multimedia responses. Therefore, as speech recognition technology is transferred from the laboratory to the marketplace, robustness in recognition becomes increasingly significant. This E-book should be useful to computer engineers interested in recent developments in speech recognition technology.


Speech recognition is becoming part of our everyday lives. Voice dialing in automobiles allows users to use their phones while keeping their eyes on the road and their hands on the steering wheel. Many mobile phones also have voice dialing, and some smartphones even let users dictate text, which can be faster than typing in a small keyboard. The automobile and the mobile phone scenario share in common that there’s often background noise, so it is important that the speech recognizer works well in the presence of background noise. This book presents some interesting statistics on the impact of speech recognition technology when driving.

Modern speech recognition systems are statistical systems trained with many hours of speech samples. A mismatch occurs when the system has to recognize speech that is significantly different than the speech samples used to train the system. This can happen if a child wants to use the system but no children speech was used to train it, or if a person with an accent tries to use the system but only samples from native speakers were used to train the system. It can also happen if the system was trained with noise free speech and it has to be used in a noisy cafeteria. A speech recognition system is called robust when the error rate does not significantly increase when tested in various conditions. This book describes techniques to build speech recognition systems that are robust to background noise.

The user interface in both the automobile and the mobile phone scenarios above often follows the so-called “push-to-talk” method: the user clicks on a button on the steering wheel or on the phone and then speaks. The system then needs to determine when the user is finished, typically when there’s a long enough pause. The problem of detecting the end of speech is not trivial. If the system is looking for too short a silence, perhaps the user is not done speaking and the command gets chopped, resulting in a recognition error. On the other hand, if the system is programmed to “hear” too long a silence, then the perceived system latency increases, plus it increase the chance that speech from another user leaks in. Voice Activity Detection, as it is often called, is a problem in the presence of background noise, especially non-stationary noises such as other speakers. This book devotes the first four chapters to this problem.

To eliminate the mismatch between training and test conditions, the speech samples used to train the speech recognizer often contains a large variation in background noises, with the hope that the noise of the test utterance is similar to the noise encountered during training. While that happens sometimes, it is very hard to cover in training all the possible types of noise conditions. The rest of the book describes techniques that attempt to either denoise the speech signal prior to recognition, or modify the recognizer to be more robust to such noise conditions.

The book edited by Prof. Ramirez and Prof. Gorriz provides a broad overview on the problem of noise robust speech recognition. The chapters, written by experts in their respective field, will make the reader acquainted with a number of topics in this space and provide researchers and practitioners with a set of useful techniques that have been developed in the last few years.

Alex Acero
Microsoft Research
Redmond, WA


.Multi-Objective Optimization In Theory and Practice II: Metaheuristic Algorithms.
.Arduino and SCILAB based Projects.
.Arduino meets MATLAB: Interfacing, Programs and Simulink.
.Budget Optimization and Allocation: An Evolutionary Computing Based Model.