Summary
In this work, we address the question of generating accurate likelihood estimates from multi-frame observations in speaker and language recognition. Using a simple theoretical model, we extend the basic assumption of independent frames to include two refinements: a local correlation model across neighboring frames, and a global uncertainty due to train/test channel mismatch. We present an algorithm for discriminative training of the resulting duration model based on logistic regression combined with a bisection search. We show that using this model we can achieve state-of-the-art performance for the NIST LRE07 task. Finally, we show that these more accurate class likelihood estimates can be combined to solve multiple problems using Bayes' rule, so that we can expand our single parametric backend to replace all six separate back-ends used in our NIST LRE submission for both closed and open sets.