Recent Advances in Robust Speech Recognition Technology

Speech recognition is becoming part of our everyday lives. Voice dialing in automobiles allows users to use their phones while keeping their eyes on the road and their hands on the steering wheel. Many mobile phones also have voice dialing, and some smartphones even let users dictate text, which can be faster than typing in a small keyboard. The automobile and the mobile phone scenario share in common that there’s often background noise, so it is important that the speech recognizer works well in the presence of background noise. This book presents some interesting statistics on the impact of speech recognition technology when driving.

Modern speech recognition systems are statistical systems trained with many hours of speech samples. A mismatch occurs when the system has to recognize speech that is significantly different than the speech samples used to train the system. This can happen if a child wants to use the system but no children speech was used to train it, or if a person with an accent tries to use the system but only samples from native speakers were used to train the system. It can also happen if the system was trained with noise free speech and it has to be used in a noisy cafeteria. A speech recognition system is called robust when the error rate does not significantly increase when tested in various conditions. This book describes techniques to build speech recognition systems that are robust to background noise.

The user interface in both the automobile and the mobile phone scenarios above often follows the so-called “push-to-talk” method: the user clicks on a button on the steering wheel or on the phone and then speaks. The system then needs to determine when the user is finished, typically when there’s a long enough pause. The problem of detecting the end of speech is not trivial. If the system is looking for too short a silence, perhaps the user is not done speaking and the command gets chopped, resulting in a recognition error. On the other hand, if the system is programmed to “hear” too long a silence, then the perceived system latency increases, plus it increase the chance that speech from another user leaks in. Voice Activity Detection, as it is often called, is a problem in the presence of background noise, especially non-stationary noises such as other speakers. This book devotes the first four chapters to this problem.

To eliminate the mismatch between training and test conditions, the speech samples used to train the speech recognizer often contains a large variation in background noises, with the hope that the noise of the test utterance is similar to the noise encountered during training. While that happens sometimes, it is very hard to cover in training all the possible types of noise conditions. The rest of the book describes techniques that attempt to either denoise the speech signal prior to recognition, or modify the recognizer to be more robust to such noise conditions.

The book edited by Prof. Ramirez and Prof. Gorriz provides a broad overview on the problem of noise robust speech recognition. The chapters, written by experts in their respective field, will make the reader acquainted with a number of topics in this space and provide researchers and practitioners with a set of useful techniques that have been developed in the last few years.

Alex Acero
Microsoft Research
Redmond, WA
USA

Bookshelf

Book Categories

What's new

For Reviewers

For Buyers and Librarians

For Authors and Editors

Marketing Opportunities

Advertising

General Queries

Bookshelf

Book Categories

What's new

For Reviewers

For Buyers and Librarians

For Authors and Editors

Marketing Opportunities

Advertising

General Queries

Recent Advances in Robust Speech Recognition Technology

Site Breadcrumb

Recent Advances in Robust Speech Recognition Technology

Introduction

Foreword