Publications

Refine Results

(Filters Applied) Clear All

Exploring the impact of advanced front-end processing on NIST speaker recognition microphone tasks

Summary

The NIST speaker recognition evaluation (SRE) featured microphone data in the 2005-2010 evaluations. The preprocessing and use of this data has typically been performed with telephone bandwidth and quantization. Although this approach is viable, it ignores the richer properties of the microphone data-multiple channels, high-rate sampling, linear encoding, ambient noise properties, etc. In this paper, we explore alternate choices of preprocessing and examine their effects on speaker recognition performance. Specifically, we consider the effects of quantization, sampling rate, enhancement, and two-channel speech activity detection. Experiments on the NIST 2010 SRE interview microphone corpus demonstrate that performance can be dramatically improved with a different preprocessing chain.
READ LESS

Summary

The NIST speaker recognition evaluation (SRE) featured microphone data in the 2005-2010 evaluations. The preprocessing and use of this data has typically been performed with telephone bandwidth and quantization. Although this approach is viable, it ignores the richer properties of the microphone data-multiple channels, high-rate sampling, linear encoding, ambient noise...

READ MORE

FY11 Line-Supported Bio-Next Program - Multi-modal Early Detection Interactive Classifier (MEDIC) for mild traumatic brain injury (mTBI) triage

Summary

The Multi-modal Early Detection Interactive Classifier (MEDIC) is a triage system designed to enable rapid assessment of mild traumatic brain injury (mTBI) when access to expert diagnosis is limited as in a battlefield setting. MEDIC is based on supervised classification that requires three fundamental components to function correctly; these are data, features, and truth. The MEDIC system can act as a data collection device in addition to being an assessment tool. Therefore, it enables a solution to one of the fundamental challenges in understanding mTBI: the lack of useful data. The vision of MEDIC is to fuse results from stimulus tests in each of four modalitites - auditory, occular, vocal, and intracranial pressure - and provide them to a classifier. With appropriate data for training, the MEDIC classifier is expected to provide an immediate decision of whether the subject has a strong likelihood of having sustained an mTBI and therefore requires an expert diagnosis from a neurologist. The tests within each modalitity were designed to balance the capacity of objective assessment and the maturity of the underlying technology against the ability to distinguish injured from non-injured subjects according to published results. Selection of existing modalities and underlying features represents the best available, low cost, portable technology with a reasonable chance of success.
READ LESS

Summary

The Multi-modal Early Detection Interactive Classifier (MEDIC) is a triage system designed to enable rapid assessment of mild traumatic brain injury (mTBI) when access to expert diagnosis is limited as in a battlefield setting. MEDIC is based on supervised classification that requires three fundamental components to function correctly; these are...

READ MORE

Investigating acoustic correlates of human vocal fold vibratory phase asymmetry through modeling and laryngeal high-speed videoendoscopy

Published in:
J. Acoust. Soc. Am., Vol. 130, No. 6, December 2011, pp. 3999-4009.

Summary

Vocal fold vibratory asymmetry is often associated with inefficient sound production through its impact on source spectral tilt. This association is investigated in both a computational voice production model and a group of 47 human subjects. The model provides indirect control over the degree of left-right phase asymmetry within a nonlinear source-filter framework, and high-speed videoendoscopy provides in vivo measures of vocal fold vibratory asymmetry. Source spectral tilt measures are estimated from the inverse-filtered spectrum of the simulated and recorded radiated acoustic pressure. As expected, model simulations indicated that increasing left-right phase asymmetry induces steeper spectral tilt. Subject data, however, reveal that none of the vibratory asymmetry measures correlates with spectral tilt measures. Probing further into physiological correlates of spectral tilt that might be affected by asymmetry, the glottal area waveform is parameterized to obtain measures of the open phase (open/plateau quotient) and closing phase (speed/closing quotient). Subjects' left-right phase asymmetry exhibits low, but statistically significant, correlations with speed quotient (r=0.45) and closing quotient (r=-0.39). Results call for future studies into the effect of asymmetric vocal fold vibrartion on glottal airflow and the associated impact on voice source spectral properties and vocal efficiency.
READ LESS

Summary

Vocal fold vibratory asymmetry is often associated with inefficient sound production through its impact on source spectral tilt. This association is investigated in both a computational voice production model and a group of 47 human subjects. The model provides indirect control over the degree of left-right phase asymmetry within a...

READ MORE

Automatic detection of depression in speech using Gaussian mixture modeling with factor analysis

Summary

Of increasing importance in the civilian and military population is the recognition of Major Depressive Disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we investigate automatic classifiers of depression state, that have the important property of mitigating nuisances due to data variability, such as speaker and channel effects, unrelated to levels of depression. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a six-week duration, along with standard clinical HAMD depression ratings. Preliminary experiments indicate that by mitigating nuisances, thus focusing on depression severity as a class, we can significantly improve classification accuracy over baseline Gaussian-mixture-model-based classifiers.
READ LESS

Summary

Of increasing importance in the civilian and military population is the recognition of Major Depressive Disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we investigate automatic classifiers of depression state, that have the important property...

READ MORE

Sinewave representations of nonmodality

Summary

Regions of nonmodal phonation, exhibiting deviations from uniform glottal-pulse periods and amplitudes, occur often and convey information about speaker- and linguistic-dependent factors. Such waveforms pose challenges for speech modeling, analysis/synthesis, and processing. In this paper, we investigate the representation of nonmodal pulse trains as a sum of harmonically-related sinewaves with time-varying amplitudes, phases, and frequencies. We show that a sinewave representation of any impulsive signal is not unique and also the converse, i.e., frame-based measurements of the underlying sinewave representation can yield different impulse trains. Finally, we argue how this ambiguity may explain addition, deletion, and movement of pulses in sinewave synthesis and a specific illustrative example of time-scale modification of a nonmodal case of diplophonia.
READ LESS

Summary

Regions of nonmodal phonation, exhibiting deviations from uniform glottal-pulse periods and amplitudes, occur often and convey information about speaker- and linguistic-dependent factors. Such waveforms pose challenges for speech modeling, analysis/synthesis, and processing. In this paper, we investigate the representation of nonmodal pulse trains as a sum of harmonically-related sinewaves with...

READ MORE

Phonologically-based biomarkers for major depressive disorder

Summary

Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.
READ LESS

Summary

Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of...

READ MORE

A time-warping framework for speech turbulence-noise component estimation during aperiodic phonation

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 22-27 May 2011, pp. 5404-5407.

Summary

The accurate estimation of turbulence noise affects many areas of speech processing including separate modification of the noise component, analysis of degree of speech aspiration for treating pathological voice, the automatic labeling of speech voicing, as well as speaker characterization and recognition. Previous work in the literature has provided methods by which such a high-quality noise component may be estimated in near-periodic speech, but it is known that these methods tend to leak aperiodic phonation (with even slight deviations from periodicity) into the noise-component estimate. In this paper, we improve upon existing algorithms in conditions of aperiodicity by introducing a time-warping based approach to speech noise-component estimation, demonstrating the results on both natural and synthetic speech examples.
READ LESS

Summary

The accurate estimation of turbulence noise affects many areas of speech processing including separate modification of the noise component, analysis of degree of speech aspiration for treating pathological voice, the automatic labeling of speech voicing, as well as speaker characterization and recognition. Previous work in the literature has provided methods...

READ MORE

Multi-pitch estimation by a joint 2-D representation of pitch and pitch dynamics

Published in:
INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, 26-30 September 2010, pp. 645-648.

Summary

Multi-pitch estimation of co-channel speech is especially challenging when the underlying pitch tracks are close in pitch value (e.g., when pitch tracks cross). Building on our previous work, we demonstrate the utility of a two-dimensional (2-D) analysis method of speech for this problem by exploiting its joint representation of pitch and pitch-derivative information from distinct speakers. Specifically, we propose a novel multi-pitch estimation method consisting of 1) a data-driven classifier for pitch candidate selection, 2) local pitch and pitch-derivative estimation by k-means clustering, and 3) a Kalman filtering mechanism for pitch tracking and assignment. We evaluate our method on a database of all-voiced speech mixtures and illustrate its capability to estimate pitch tracks in cases where pitch tracks are separate and when they are close in pitch value (e.g., at crossings).
READ LESS

Summary

Multi-pitch estimation of co-channel speech is especially challenging when the underlying pitch tracks are close in pitch value (e.g., when pitch tracks cross). Building on our previous work, we demonstrate the utility of a two-dimensional (2-D) analysis method of speech for this problem by exploiting its joint representation of pitch...

READ MORE

Voice production mechanisms following phonosurgical treatment of early glottic cancer

Published in:
Ann. Ontol., Rhinol. Laryngol., Vol. 119, No. 1, 2010, pp. 1-9.

Summary

Although near-normal conversational voices can be achieved with the phonosurgical management of early glottic cancer, there are still acoustic and aerodynamic deficits in vocal function that must be better understood to help further optimize phonosurgical interventions. Stroboscopic assessment is inadequate for this purpose. A newly discovered color high-speed videoendoscopy (HSV) system that included time-synchronized recordings of the acoustic signal was used to perform a detailed examination of voice production mechanisms in 14 subjects. Digital image processing techniques were used to quantify glottal phonatory function and to delineate relationships between vocal fold vibratory properties and acoustic perturbation measures. [not complete]
READ LESS

Summary

Although near-normal conversational voices can be achieved with the phonosurgical management of early glottic cancer, there are still acoustic and aerodynamic deficits in vocal function that must be better understood to help further optimize phonosurgical interventions. Stroboscopic assessment is inadequate for this purpose. A newly discovered color high-speed videoendoscopy (HSV)...

READ MORE

Preserving the character of perturbations in scaled pitch contours

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 5 March 2010, pp. 417-420.

Summary

The global and fine dynamic components of a pitch contour in voice production, as in the speaking and singing voice, are important for both the meaning and character of an utterance. In speech, for example, slow pitch inflections, rapid pitch accents, and irregular regions all comprise the pitch contour. In applications where all components of a pitch contour are stretched or compressed in the same way, as for example in time-scale modification, an unnatural scaled contour may result. In this paper, we develop a framework for scaling pitch contours, motivated by the goal of maintaining naturalness in time-scale modification of voice. Specifically, we develop a multi-band algorithm to independently modify the slow trajectory and fast perturbation components of a contour for a more natural synthesis, and we present examples where pitch contours representative of speaking and singing voice are lengthened. In the speaking voice, the frequency content of flutter or irregularity is maintained, while slow pitch inflection is simply stretched or compressed. In the singing voice, rapid vibrato is preserved while slower note-to-note variation is scaled as desired.
READ LESS

Summary

The global and fine dynamic components of a pitch contour in voice production, as in the speaking and singing voice, are important for both the meaning and character of an utterance. In speech, for example, slow pitch inflections, rapid pitch accents, and irregular regions all comprise the pitch contour. In...

READ MORE