Latent topic modeling for audio corpus summarization
August 27, 2011
Conference Paper
Author:
Published in:
INTERSPEECH 2011, 27-31 August 2011, pp. 913-916.
R&D Area:
Summary
This work presents techniques for automatically summarizing the topical content of an audio corpus. Probabilistic latent semantic analysis (PLSA) is used to learn a set of latent topics in an unsupervised fashion. These latent topics are ranked by their relative importance in the corpus and a summary of each topic is generated from signature words that aptly describe the content of that topic. This paper presents techniques for producing a high quality summarization. An example summarization of conversational data from the Fisher corpus that demonstrates the effectiveness of our approach is presented and evaluated.