Publications

Refine Results

(Filters Applied) Clear All

Blind clustering of speech utterances based on speaker and language characteristics

Published in:
5th Int. Conf. Spoken Language Processing (ICSLP), 30 November - 4 December 1998.

Summary

Classical speaker and language recognition techniques can be applied to the classification of unknown utterances by computing the likelihoods of the utterances given a set of well trained target models. This paper addresses the problem of grouping unknown utterances when no information is available regarding the speaker or language classes or even the total number of classes. Approaches to blind message clustering are presented based on conventional hierarchical clustering techniques and an integrated cluster generation and selection method called the d* algorithm. Results are presented using message sets derived from the Switchboard and Callfriend corpora. Potential applications include automatic indexing of recorded speech corpora by speaker/language tags and automatic or semiautomatic selection of speaker specific speech utterances for speaker recognition adaptation.
READ LESS

Summary

Classical speaker and language recognition techniques can be applied to the classification of unknown utterances by computing the likelihoods of the utterances given a set of well trained target models. This paper addresses the problem of grouping unknown utterances when no information is available regarding the speaker or language classes...

READ MORE

Speech recognition by humans and machines under conditions with severe channel variability and noise

Published in:
SPIE, Vol. 3077, Applications and Science of Artificial Neural Networks III, 21-24 April 1997, pp. 46-57.

Summary

Despite dramatic recent advances in speech recognition technology, speech recognizers still perform much worse than humans. The difference in performance between humans and machines is most dramatic when variable amounts and types of filtering and noise are present during testing. For example, humans readily understand speech that is low-pass filtered below 3 kHz or high-pass filtered above 1kHz. Machines trained with wide-band speech, however, degrade dramatically under these conditions. An approach to compensate for variable unknown sharp filtering and noise is presented which uses mel-filter-bank magnitudes as input features, estimates the signal-to-noise ratio (SNR) for each filter, and uses missing feature theory to dynamically modify the probability computations performed using Gaussian Mixture or Radial Basis Function neural network classifiers embedded within Hidden Markov Model (HMM) recognizers. The approach was successfully demonstrated using a talker-independent digit recognition task. It was found that recognition accuracy across many conditions rises from below 50 % to above 95 % with this approach. These promising results suggest future work to dynamically estimate SNR's and to explore the dynamics of human adaptation to channel and noise variability.
READ LESS

Summary

Despite dramatic recent advances in speech recognition technology, speech recognizers still perform much worse than humans. The difference in performance between humans and machines is most dramatic when variable amounts and types of filtering and noise are present during testing. For example, humans readily understand speech that is low-pass filtered...

READ MORE

Automated English-Korean translation for enhanced coalition communications

Summary

This article describes our progress on automated, two-way English-Korean translation of text and speech for enhanced military coalition communications. Our goal is to improve multilingual communications by producing accurate translations across a number of languages. Therefore, we have chosen an interlingua-based approach to machine translation that readily extends to multiple languages. In this approach, a natural-language-understanding system transforms the input into an intermediate-meaning representation called a semantic frame, which serves as the basis for generating output in multiple languages. To produce useful, accurate, and effective translation systems in the short term, we have focused on limited military-task domains, and have configured our system as a translator's aid so that the human translator can confirm or edit the machine translation. We have obtained promising results in translation of telegraphic military messages in a naval domain, and have successfully extended the system to additional military domains. The system has been demonstrated in a coalition exercise and at Combined Forces Command in the Republic of Korea. From these demonstrations we learned that the system must be robust enough to handle new inputs, which is why we have developed a multistage robust translation strategy, including a part-of-speech tagging technique to handle new works, and a fragmentation strategy for handling complex sentences. Our current work emphasizes ongoing development of these robust translation techniques and extending the translation system to application domains of interest to users in the military coalition environment in the Republic of Korea.
READ LESS

Summary

This article describes our progress on automated, two-way English-Korean translation of text and speech for enhanced military coalition communications. Our goal is to improve multilingual communications by producing accurate translations across a number of languages. Therefore, we have chosen an interlingua-based approach to machine translation that readily extends to multiple...

READ MORE

Unsupervised topic clustering of switchboard speech messages

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 7-10 May 1996, pp. 315-318.

Summary

This paper presents a statistical technique which can be used to automatically group speech data records based on the similarity of their content. A tree-based clustering algorithm is used to generate a hierarchical structure for the corpus. This structure can then be used to guide the search for similar material in data from other corpora. The SWITCHBOARD Speech Corpus was used to demonstrate these techniques, since it provides sets of speech files which are nominally on the same topic. Excellent automatic clustering was achieved on the truth text transcripts provided with the SWITCHBOARD corpus, with an average cluster purity of 97.3%. Degraded clustering was achieved using the output transcriptions of a speech recognizer, with a clustering purity of 61.4%.
READ LESS

Summary

This paper presents a statistical technique which can be used to automatically group speech data records based on the similarity of their content. A tree-based clustering algorithm is used to generate a hierarchical structure for the corpus. This structure can then be used to guide the search for similar material...

READ MORE

The effects of telephone transmission degradations on speaker recognition performance

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, Speech, 9-12 May 1995, pp. 329-332.

Summary

The two largest factors affecting automatic speaker identification performance are the size of the population an the degradations introduced by noisy communication, channels (e.g., telephone transmission). To examine experimentally these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively. This paper also presents experiments which examine and attempt to quantify the performance loss associated with various telephone degradations by systematically degrading the TIMIT speech in a manner consistent with measured NTIMIT degradations and measuring the performance loss at each step. It is found that the standard degradations of filtering and additive noise do not account for all of the performance gap between the TIMIT and NTIMIT data. Measurements of nonlinear microphone distortions are also...
READ LESS

Summary

The two largest factors affecting automatic speaker identification performance are the size of the population an the degradations introduced by noisy communication, channels (e.g., telephone transmission). To examine experimentally these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both...

READ MORE

Showing Results

1-5 of 5