Publications

Refine Results

(Filters Applied) Clear All

Retrieval and browsing of spoken content

Published in:
IEEE Signal Process. Mag., Vol. 25, No. 3, May 2008, pp. 39-49.

Summary

Ever-increasing computing power and connectivity bandwidth, together with falling storage costs, are resulting in an overwhelming amount of data of various types being produced, exchanged, and stored. Consequently, information search and retrieval has emerged as a key application area. Text-based search is the most active area, with applications that range from Web and local network search to searching for personal information residing on one's own hard-drive. Speech search has received less attention perhaps because large collections of spoken material have previously not been available. However, with cheaper storage and increased broadband access, there has been a subsequent increase in the availability of online spoken audio content such as news broadcasts, podcasts, and academic lectures. A variety of personal and commercial uses also exist. As data availability increases, the lack of adequate technology for processing spoken documents becomes the limiting factor to large-scale access to spoken content. In this article, we strive to discuss the technical issues involved in the development of information retrieval systems for spoken audio documents, concentrating on the issue of handling the errorful or incomplete output provided by ASR systems. We focus on the usage case where a user enters search terms into a search engine and is returned a collection of spoken document hits.
READ LESS

Summary

Ever-increasing computing power and connectivity bandwidth, together with falling storage costs, are resulting in an overwhelming amount of data of various types being produced, exchanged, and stored. Consequently, information search and retrieval has emerged as a key application area. Text-based search is the most active area, with applications that range...

READ MORE

Measuring the readability of automatic speech-to-text transcripts

Summary

This paper reports initial results from a novel psycholinguistic study that measures the readability of several types of speech transcripts. We define a four-part figure of merit to measure readability: accuracy of answers to comprehension questions, reaction-time for passage reading, reaction-time for question answering and a subjective rating of passage difficulty. We present results from an experiment with 28 test subjects reading transcripts in four experimental conditions.
READ LESS

Summary

This paper reports initial results from a novel psycholinguistic study that measures the readability of several types of speech transcripts. We define a four-part figure of merit to measure readability: accuracy of answers to comprehension questions, reaction-time for passage reading, reaction-time for question answering and a subjective rating of passage...

READ MORE

High-performance low-complexity wordspotting using neural networks

Published in:
IEEE Trans. Signal Process., Vol. 45, No. 11, November 1997, pp. 2864-2870.

Summary

A high-performance low-complexity neural network wordspotter was developed using radial basis function (RBF) neural networks in a hidden Markov model (HMM) framework. Two new complementary approaches substantially improve performance on the talker independent Switchboard corpus. Figure of Merit (FOM) training adapts wordspotter parameters to directly improve the FOM performance metric, and voice transformations generate additional training examples by warping the spectra of training data to mimic across-talker vocal tract variability.
READ LESS

Summary

A high-performance low-complexity neural network wordspotter was developed using radial basis function (RBF) neural networks in a hidden Markov model (HMM) framework. Two new complementary approaches substantially improve performance on the talker independent Switchboard corpus. Figure of Merit (FOM) training adapts wordspotter parameters to directly improve the FOM performance metric...

READ MORE

Speech recognition by machines and humans

Published in:
Speech Commun., Vol. 22, No. 1, July 1997, pp. 1-15.

Summary

This paper reviews past work comparing modern speech recognition systems and humans to determine how far recent dramatic advances in technology have progressed towards the goal of human-like performance. Comparisons use six modern speech corpora with vocabularies ranging from 10 to more than 65,000 words and content ranging from read isolated words to spontaneous conversations. Error rates of machines are often more than an order of magnitude greater than those of humans for quiet, wideband, read speech. Machine performance degrades further below that of humans in noise, with channel variability, and for spontaneous speech. Humans can also recognize quiet, clearly spoken nonsense syllables and nonsense sentences with little high-level grammatical information. These comparisons suggest that the human-machine performance gap can be reduced by basic research on improving low-level acoustic-phonetic modeling, on improving robustness with noise and channel variability, and on more accurately modeling spontaneous speech.
READ LESS

Summary

This paper reviews past work comparing modern speech recognition systems and humans to determine how far recent dramatic advances in technology have progressed towards the goal of human-like performance. Comparisons use six modern speech corpora with vocabularies ranging from 10 to more than 65,000 words and content ranging from read...

READ MORE

Speech recognition by humans and machines under conditions with severe channel variability and noise

Published in:
SPIE, Vol. 3077, Applications and Science of Artificial Neural Networks III, 21-24 April 1997, pp. 46-57.

Summary

Despite dramatic recent advances in speech recognition technology, speech recognizers still perform much worse than humans. The difference in performance between humans and machines is most dramatic when variable amounts and types of filtering and noise are present during testing. For example, humans readily understand speech that is low-pass filtered below 3 kHz or high-pass filtered above 1kHz. Machines trained with wide-band speech, however, degrade dramatically under these conditions. An approach to compensate for variable unknown sharp filtering and noise is presented which uses mel-filter-bank magnitudes as input features, estimates the signal-to-noise ratio (SNR) for each filter, and uses missing feature theory to dynamically modify the probability computations performed using Gaussian Mixture or Radial Basis Function neural network classifiers embedded within Hidden Markov Model (HMM) recognizers. The approach was successfully demonstrated using a talker-independent digit recognition task. It was found that recognition accuracy across many conditions rises from below 50 % to above 95 % with this approach. These promising results suggest future work to dynamically estimate SNR's and to explore the dynamics of human adaptation to channel and noise variability.
READ LESS

Summary

Despite dramatic recent advances in speech recognition technology, speech recognizers still perform much worse than humans. The difference in performance between humans and machines is most dramatic when variable amounts and types of filtering and noise are present during testing. For example, humans readily understand speech that is low-pass filtered...

READ MORE

Recognition by humans and machines: miles to go before we sleep

Published in:
Speech Commun., Vol. 18, No. 3, 1996, pp. 247-8.

Summary

Bourlard and his colleagues note that much effort over the past few years has focused on creating large-vocabulary speech recognition systems and reducing error rates measured using clean speech materials. This has led to experimental talker-independent systems with vocabularies of 65,000 words capable of transcribing sentences on a limited set of topics. Instead of comparing these systems to each other, or to last year's performance, this short comment compares them to human listeners, who are the ultimate users and judges of speech recognizers.
READ LESS

Summary

Bourlard and his colleagues note that much effort over the past few years has focused on creating large-vocabulary speech recognition systems and reducing error rates measured using clean speech materials. This has led to experimental talker-independent systems with vocabularies of 65,000 words capable of transcribing sentences on a limited set...

READ MORE

Military and government applications of human-machine communication by voice

Published in:
Proc. Natl. Acad. Sci., Vol. 92, October 1995, pp. 10011-10016.

Summary

This paper describes a range of opportunities for military and government applications of human-machine communication by voice, based on visits and contacts with numerous user organizations in the United States. The applications include some that appear to be feasible by careful integration of current state-of-the-art technology and others that will require a varying mix of advances in speech technology and in integration of the technology into applications environments. Applications that are described include (1) speech recognition and synthesis for mobile command and control; (2) speech processing for a portable multifunction soldier's computer; (3) speech- and language-based technology for naval combat team tactical training; (4) speech technology for command and control on a carrier flight deck; (5) control of auxiliary systems, and alert and warning generation, in fighter aircraft and helicopters; and (6) voice check-in, report entry, and communication for law enforcement agents or special forces. A phased approach for transfer of the technology into applications is advocated, where integration of applications systems is pursued in parallel with advanced research to meet future needs.
READ LESS

Summary

This paper describes a range of opportunities for military and government applications of human-machine communication by voice, based on visits and contacts with numerous user organizations in the United States. The applications include some that appear to be feasible by careful integration of current state-of-the-art technology and others that will...

READ MORE

A comparison of signal processing front ends for automatic word recognition

Published in:
IEEE Trans. Speech Audio Process., Vol. 3, No. 4, July 1995, pp. 286-293.

Summary

This paper compares the word error rate of a speech recognizer using several signal processing front ends based on auditory properties. Front ends were compared with a control mel filter banks (MFB) based cepstral front end in clean speech and with speech degraded by noise and spectral variability, using the TI-105 isolated word database. MFB recognition error rates ranged from 0.5 to 3.1%,, and the reduction in error rates provided by auditory models was less than 0.5 percentage points. Some earlier studies that demonstrated considerably more improvement with auditory models used linear predictive coding (LPC) based control front ends. This paper shows that MFB cepstra significantly outperform LPC cepstra under noisy conditions. Techniques using an optimal linear combination of features for data reduction were also evaluated.
READ LESS

Summary

This paper compares the word error rate of a speech recognizer using several signal processing front ends based on auditory properties. Front ends were compared with a control mel filter banks (MFB) based cepstral front end in clean speech and with speech degraded by noise and spectral variability, using the...

READ MORE

Wordspotter training using figure-of-merit back propagation

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, Speech Processing, 19-22 April 1994, pp. 389-392.

Summary

A new approach to wordspotter training is presented which directly maximizes the Figure of Merit (FOM) defined as the average detection rate over a specified range of false alarm rates. This systematic approach to discriminant training for wordspotters eliminates the necessity of ad hoc thresholds and tuning. It improves the FOM of wordspotters tested using cross-validation on the credit-card speech corpus training conversations by 4 to 5 percentage points to roughly 70% This improved performance requires little extra complexity during wordspotting and only two extra passes through the training data during training. The FOM gradient is computed analytically for each putative hit, back-propagated through HMM word models using the Viterbi alignment, and used to adjust RBF hidden node centers and state-weights associated with every node in HMM keyword models.
READ LESS

Summary

A new approach to wordspotter training is presented which directly maximizes the Figure of Merit (FOM) defined as the average detection rate over a specified range of false alarm rates. This systematic approach to discriminant training for wordspotters eliminates the necessity of ad hoc thresholds and tuning. It improves the...

READ MORE

Demonstrations and applications of spoken language technology: highlights and perspectives from the 1993 ARPA Spoken Language Technology and Applications Day

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, Speech Processing, 19-22 April 1994, pp. 337-340.

Summary

The ARPA Spoken Language Technology and Applications Day (SLTA'93) was a special workshop which presented a set of live, state-of-the-art demonstrations of speech recognition and Spoken Language Understanding systems. The purpose of this paper is to provide perspective on current opportunities for applications which they can enable, and reviewing the applications opportunities and needs cited by panelists and other members of the user community.
READ LESS

Summary

The ARPA Spoken Language Technology and Applications Day (SLTA'93) was a special workshop which presented a set of live, state-of-the-art demonstrations of speech recognition and Spoken Language Understanding systems. The purpose of this paper is to provide perspective on current opportunities for applications which they can enable, and reviewing the...

READ MORE