Summary
We present the framework for a Scalable Phonetic Vocoder (SPV) capable of operating at bit rates from 300 - 1100 bps. The underlying system uses an HMM-based phonetic speech recognizer to estimate the parameters for MELP speech synthesis. We extend this baseline technique in three ways. First, we introduce the concept of predictive time evolution to generate a smoother path for the synthesizer parameters, and show that it improves speech quality. Then, since the output speech from the phonetic vocoder is still limited by such low bit rates, we propose a scalable system where the accuracy of the MELP parameters is increased by vector quantizing the error signal between the true and phonetic-estimated MELP parameters. Finally, we apply an extremely flexible technique for exploiting correlations in these parameters over time, which we call Joint Predictive Vector Quantization (JPVQ).We show that significant quality improvement can be attained by adding as few as 400 bps to the baseline phonetic vocoder using JPVQ. The resulting SPV system provides a flexible platform for adjusting the phonetic vocoder bit rate and speech quality.