Publications

Refine Results

(Filters Applied) Clear All

Speaker recognition from coded speech and the effects of score normalization

Published in:
Proc. Thirty-Fifth Asilomar Conf. on Signals, Systems and Computers, Vol. 2, 4-7 November 2001, pp. 1562-1567.

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments used standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss in recognition performance for toll quality speech coders and slightly more loss when lower quality speech coders are used. Speaker recognition from coded speech using handset dependent score normalization and test score normalization are examined. Both types of score normalization significantly improve performance, and can eliminate the performance loss that occurs when there is a mismatch between training and testing conditions.
READ LESS

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments used standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss...

READ MORE

Speaker recognition from coded speech in matched and mismatched conditions

Published in:
Proc. 2001: A Speaker Odyssey, The Speaker Recognition Workshop, 18-22 June 2001, pp. 115-20.

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments use standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss in recognition performance for toll quality speech coders and slightly more loss when lower quality speech coders are used. Speaker recognition from coded speech using handset dependent score normalization is examined, and we find that this significantly improves performance, particularly when there is a mismatch between training and testing conditions.
READ LESS

Summary

We investigate the effect of speech coding on automatic speaker recognition when training and testing conditions are matched and mismatched. Experiments use standard speech coding algorithms (GSM, G.729, G.723, MELP) and a speaker recognition system based on Gaussian mixture models adapted from a universal background model. There is little loss...

READ MORE

Estimation of handset nonlinearity with application to speaker recognition

Published in:
IEEE Trans. Speech Audio Process., Vol. 8, No. 5, September 2000, pp. 567-584.

Summary

A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. This "magnitude-only" representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that are a potential source of degradation in speaker and speech recognition algorithms. As such, the method is particularly suited to algorithms that use only spectral magnitude information. The distortion model consists of a memoryless nonlinearity sandwiched between two finite-length linear filters. Nonlinearities considered include arbitrary finite-order polynomials and parametric sigmoidal functionals derived from a carbon-button handset model. Minimization of a mean-squared spectral magnitude distance with respect to model parameters relies on iterative estimation via a gradient descent technique. Initial work has demonstrated the importance of addressing handset nonlinearity, in addition to linear distortion, in speaker recognition over telephone channels. A nonlinear handset "mapping" applied to training or testing data to reduce mismatch between different types of handset microphone outputs, improves speaker verification performance relative to linear compensation only. Finally, a method is proposed to merge the mapper strategy with a method of likelihood score normalization (hnorm) for further mismatch reduction and speaker verification performance improvement.
READ LESS

Summary

A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. This "magnitude-only" representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that...

READ MORE

Speaker recognition using G.729 speech codec parameters

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. II, 5-9 June 2000, pp. 1089-1092.

Summary

Experiments in Gaussian-mixture-model speaker recognition from mel-filter bank energies (MFBs) of the G.729 codec all-pole spectral envelope, showed significant performance loss relative to the standard mel-cepstral coefficients of G.729 synthesized (coded) speech. In this paper, we investigate two approaches to recover speaker recognition performance from G.729 parameters, rather than deriving cepstra from MFBs of an all-pole spectrum. Specifically, the G.729 LSFs are converted to "direct" cepstral coefficients for which there exists a one-to-one correspondence with the LSFs. The G.729 residual is also considered; in particular, appending G.729 pitch as a single parameter to the direct cepstral coefficients gives further performance gain. The second nonparametric approach uses the original MFB paradigm, but adds harmonic striations to the G.729 all-pole spectral envelope. Although obtaining considerable performance gains with these methods, we have yet to match the performance of G.729 synthesized speech, motivating the need for representing additional fine structure of the G.729 residual.
READ LESS

Summary

Experiments in Gaussian-mixture-model speaker recognition from mel-filter bank energies (MFBs) of the G.729 codec all-pole spectral envelope, showed significant performance loss relative to the standard mel-cepstral coefficients of G.729 synthesized (coded) speech. In this paper, we investigate two approaches to recover speaker recognition performance from G.729 parameters, rather than deriving...

READ MORE

Approaches to speaker detection and tracking in conversational speech

Published in:
Digit. Signal Process., Vol. 10, No. 1, January/April/July, 2000, pp. 93-112. (Fifth Annual NIST Speaker Recognition Workshop, 3-4 June 1999.)

Summary

Two approaches to detecting and tracking speakers in multispeaker audio are described. Both approaches use an adapted Gaussian mixture model, universal background model (GMM-UBM) speaker detection system as the core speaker recognition engine. In one approach, the individual log-likelihood ratio scores, which are produced on a frame-by-frame basis by the GMM-UBM system, are used to first partition the speech file into speaker homogenous regions and then to create scores for these regions. We refer to this approach as internal segmentation. Another approach uses an external segmentation algorithm, based on blind clustering, to partition the speech file into speaker homogenous regions. The adapted GMM-UBM system then scores each of these regions as in the single-speaker recognition case. We show that the external segmentation system outperforms the internal segmentation system for both detection and tracking. In addition, we show how different components of the detection and tracking algorithms contribute to the overall system performance.
READ LESS

Summary

Two approaches to detecting and tracking speakers in multispeaker audio are described. Both approaches use an adapted Gaussian mixture model, universal background model (GMM-UBM) speaker detection system as the core speaker recognition engine. In one approach, the individual log-likelihood ratio scores, which are produced on a frame-by-frame basis by the...

READ MORE

Speaker verification using adapted Gaussian mixture models

Published in:
Digit. Signal Process., Vol. 10, No. 1-3, January/April/July, 2000, pp. 19-41. (Fifth Annual NIST Speaker Recognition Workshop, 3-4 June 1999.)

Summary

In this paper we describe the major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented.
READ LESS

Summary

In this paper we describe the major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background...

READ MORE

Estimation of modulation based on FM-to-AM transduction: two-sinusoid case

Published in:
IEEE Trans. Signal Process., Vol. 47, No. 11, November 1999, pp. 3084-3097.

Summary

A method is described for estimating the amplitude modulation (AM) and the frequency modulation (FM) of the components of a signal that consists of two AM-FM sinusoids. The approach is based on the transduction of FM to AM that occurs whenever a signal of varying frequency passes through a filter with a nonflat frequency response. The objective is to separate the AM and FM of the sinusoids from the amplitude envelopes of the output of two transduction filters, where the AM and FM are nonlinearly combined in the amplitude envelopes. A current scheme is first refined for AM-FM estimation of a single AM-FM sinusoid by iteratively inverting the AM and FM estimates to reduce error introduced in transduction. The transduction filter pair is designed relying on both a time-and frequency-domain characterization of transduction error. The approach is then extended to the case of two AM-FM sinusoids by essentially reducing the problem to two single-component AM-FM estimation problems. By exploiting the beating in the amplitude envelope of each filter output due to the two-sinusoidal input, a closed-form solution is obtained. This solution is also improved upon by iterative refinement. The AM-FM estimation methods are evaluated through an error analysis and are illustrated for a wide range of AM-FM signals.
READ LESS

Summary

A method is described for estimating the amplitude modulation (AM) and the frequency modulation (FM) of the components of a signal that consists of two AM-FM sinusoids. The approach is based on the transduction of FM to AM that occurs whenever a signal of varying frequency passes through a filter...

READ MORE

Shunting networks for multi-band AM-FM decomposition

Published in:
Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 17-20 October 1999.

Summary

We describe a transduction-based, neurodynamic approach to estimating the amplitude-modulated (AM) and frequency-modulated (FM) components of a signal. We show that the transduction approach can be realized as a bank of constant-Q bandpass filters followed by envelope detectors and shunting neural networks, and the resulting dynamical system is capable of robust AM-FM estimation. Our model is consistent with recent psychophysical experiments that indicate AM and FM components of acoustic signals may be transformed into a common neural code in the brain stem via FM-to-AM transduction. The shunting network for AM-FM decomposition is followed by a contrast enhancement shunting network that provides a mechanism for robustly selecting auditory filter channels as the FM of an input stimulus sweeps across the multiple filters. The AM-FM output of the shunting networks may provide a robust feature representation and is being considered for applications in signal recognition and multi-component decomposition problems.
READ LESS

Summary

We describe a transduction-based, neurodynamic approach to estimating the amplitude-modulated (AM) and frequency-modulated (FM) components of a signal. We show that the transduction approach can be realized as a bank of constant-Q bandpass filters followed by envelope detectors and shunting neural networks, and the resulting dynamical system is capable of...

READ MORE

Speaker and language recognition using speech codec parameters

Summary

In this paper, we investigate the effect of speech coding on speaker and language recognition tasks. Three coders were selected to cover a wide range of quality and bit rates: GSM at 12.2 kb/s, G.729 at 8 kb/s, and G.723.1 at 5.3 kb/s. Our objective is to measure recognition performance from either the synthesized speech or directly from the coder parameters themselves. We show that using speech synthesized from the three codecs, GMM-based speaker verification and phone-based language recognition performance generally degrades with coder bit rate, i.e., from GSM to G.729 to G.723.1, relative to an uncoded baseline. In addition, speaker verification for all codecs shows a performance decrease as the degree of mismatch between training and testing conditions increases, while language recognition exhibited no decrease in performance. We also present initial results in determining the relative importance of codec system components in their direct use for recognition tasks. For the G.729 codec, it is shown that removal of the post- filter in the decoder helps speaker verification performance under the mismatched condition. On the other hand, with use of G.729 LSF-based mel-cepstra, performance decreases under all conditions, indicating the need for a residual contribution to the feature representation.
READ LESS

Summary

In this paper, we investigate the effect of speech coding on speaker and language recognition tasks. Three coders were selected to cover a wide range of quality and bit rates: GSM at 12.2 kb/s, G.729 at 8 kb/s, and G.723.1 at 5.3 kb/s. Our objective is to measure recognition performance...

READ MORE

Modeling of the glottal flow derivative waveform with application to speaker identification

Published in:
IEEE Trans. Speech Audio Process., Vol. 7, No. 5, September 1999, pp. 569-586.

Summary

An automatic technique for estimating and modeling the glottal flow derivative source waveform from speech, and applying the model parameters to speaker identification, is presented. The estimate of the glottal flow derivative is decomposed into coarse structure, representing the general flow shape, and fine structure, comprising aspiration and other perturbations in the flow, from which model parameters are obtained. The glottal flow derivative is estimated using an inverse filter determined within a time interval of vocal-fold closure that is identified through differences in formant frequency modulation during the open and closed phases of the glottal cycle. This formant motion is predicted by Ananthapadmanabha and Fant to be a result of time-varying and nonlinear source/vocal tract coupling within a glottal cycle. The glottal flow derivative estimate is modeled using the Liljencrants-Fant model to capture its coarse structure, while the fine structure of the flow derivative is represented through energy and perturbation measures. The model parameters are used in a Gaussian mixture model speaker identification (SID) system. Both coarse- and fine-structure glottal features are shown to contain significant speaker-dependent information. For a large TIMIT database subset, averaging over male and female SID scores, the coarse-structure parameters achieve about 60% accuracy, the fine-structure parameters give about 40% accuracy, and their combination yields about 70% correct identification. Finally, in preliminary experiments on the counterpart telephone-degraded NTIMIT database, about a 5% error reduction in SID scores is obtained when source features are combined with traditional mel-cepstral measures.
READ LESS

Summary

An automatic technique for estimating and modeling the glottal flow derivative source waveform from speech, and applying the model parameters to speaker identification, is presented. The estimate of the glottal flow derivative is decomposed into coarse structure, representing the general flow shape, and fine structure, comprising aspiration and other perturbations...

READ MORE