Publications

Refine Results

(Filters Applied) Clear All

R&D Areas

R&D Groups

Year

Items per page

Tagged As

topic identification Clear filter

LLTools: machine learning for human language processing

December 5, 2016

Conference Paper

Author:

Cagri K. Dagli

…

Published in:

30th Conf. on Neural Info. Processing Syst., NIPS 2016, 5-10 December 2016.

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Summary

Machine learning methods in Human Language Technology have reached a stage of maturity where widespread use is both possible and desirable. The MIT Lincoln Laboratory LLTools software suite provides a step towards this goal by providing a set of easily accessible frameworks for incorporating speech, text, and entity resolution components into larger applications. For the speech processing component, the pySLGR (Speaker, Language, Gender Recognition) tool provides signal processing, standard feature analysis, speech utterance embedding, and machine learning modeling methods in Python. The text processing component in LLTools extracts semantically meaningful insights from unstructured data via entity extraction, topic modeling, and document classification. The entity resolution component in LLTools provides approximate string matching, author recognition and graph-based methods for identifying and linking different instances of the same real-world entity. We show through two applications that LLTools can be used to rapidly create and train research prototypes for human language processing.

READ LESS

Summary

LLTools: machine learning for human language processing

Finding malicious cyber discussions in social media

January 1, 2016

Journal Article

Author:

Richard P. Lippmann

…

Published in:

Lincoln Laboratory Journal, Vol. 22, No. 1, 2016, pp. 46-59.

Topic:

topic identification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Today's analysts manually examine social media networks to find discussions concerning planned cyber attacks, attacker techniques and tools, and potential victims. Applying modern machine learning approaches, Lincoln Laboratory has demonstrated the ability to automatically discover such discussions from Stack Exchange, Reddit, and Twitter posts written in English.

READ LESS

Summary

Finding malicious cyber discussions in social media

Topic identification based extrinsic evaluation of summarization techniques applied to conversational speech

March 25, 2012

Journal Article

Author:

David F. Harwath

…

Timothy J. Hazen

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 25-30 March 2012, pp. 5073-6.

Topic:

topic identification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Document summarization algorithms are most commonly evaluated according to the intrinsic quality of the summaries they produce. An alternate approach is to examine the extrinsic utility of a summary, measured by the ability of the summary to aid a human in the completion of a specific task. In this paper, we use topic identification as a proxy for relevancy determination in the context of an information retrieval task, and a summary is deemed effective if it enables a user to determine the topical content of a retrieved document. We utilize Amazon's Mechanical Turk service to perform a large-scale human study contrasting four different summarization systems applied to conversational speech from the Fisher Corpus. We show that these results appear to be correlated with the performance of an automated topic identification system, and argue that this automated system can act as a low-cost proxy for a human evaluation during the development stages of a summarization system.

READ LESS

Summary

Topic identification based extrinsic evaluation of summarization techniques applied to conversational speech

Topic modeling for spoken documents using only phonetic information

December 15, 2011

Conference Paper

Author:

Timothy J. Hazen

…

Published in:

ASRU 2011, IEEE Workshop on Automatic Speech Recognition & Understanding, 11-15 December 2011, pp. 395-400.

Topic:

topic identification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper explores both supervised and unsupervised topic modeling for spoken audio documents using only phonetic information. In cases where word-based recognition is unavailable or infeasible, phonetic information can be used to indirectly learn and capture information provided by topically relevant lexical items. In some situations, a lack of transcribed data can prevent supervised training of a same-language phonetic recognition system. In these cases, phonetic recognition can use cross-language models or self-organizing units (SOUs) learned in a completely unsupervised fashion. This paper presents recent improvements in topic modeling using only phonetic information. We present new results using recently developed techniques for discriminative training for topic identification used in conjunction with recent improvements in SOU learning. A preliminary examination of the use of unsupervised latent topic modeling for unsupervised discovery of topics and topically relevant lexical items from phonetic information is also presented.

READ LESS

Summary

Topic modeling for spoken documents using only phonetic information

MCE training techniques for topic identification of spoken audio documents

November 1, 2011

Journal Article

Author:

Timothy J. Hazen

Published in:

IEEE Trans. Audio, Speech, Language Proc., Vol. 19, No. 8, November 2011, pp. 2451-2461.

Topic:

topic identification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this paper, we discuss the use of minimum classification error (MCE) training as a means for improving traditional approaches to topic identification such as naive Bayes classifiers and support vector machines. A key element of our new MCE training techniques is their ability to efficiently apply jackknifing or leave-one-out training to yield improved models which generalize better to unseen data. Experiments were conducted using recorded human-human telephone conversations from the Fisher Corpus using feature vector representations from word-based automatic speech recognition lattices. Sizeable improvements in topic identification accuracy using the new MCE training techniques were observed.

READ LESS

Summary

MCE training techniques for topic identification of spoken audio documents

Latent topic modeling for audio corpus summarization

August 27, 2011

Conference Paper

Author:

Timothy J. Hazen

Published in:

INTERSPEECH 2011, 27-31 August 2011, pp. 913-916.

Topic:

topic identification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This work presents techniques for automatically summarizing the topical content of an audio corpus. Probabilistic latent semantic analysis (PLSA) is used to learn a set of latent topics in an unsupervised fashion. These latent topics are ranked by their relative importance in the corpus and a summary of each topic is generated from signature words that aptly describe the content of that topic. This paper presents techniques for producing a high quality summarization. An example summarization of conversational data from the Fisher corpus that demonstrates the effectiveness of our approach is presented and evaluated.

READ LESS

Summary

Latent topic modeling for audio corpus summarization

Topic identification

January 1, 2011

Book Chapter

Author:

Timothy J. Hazen

Published in:

Chapter 12, Spoken Language Understanding: Systems for Extracting from Speech, Gokhan Tur and Renato De Mori, eds., 2011, pp. 319-356.

Topic:

topic identification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this chapter we discuss the problem of identifying the underlying topics beings discussed in spoken audio recordings. We focus primarily on the issues related to supervised topic classification or detection tasks using labeled training data, but we also discuss approaches for other related tasks including novel topic detection and unsupervised topic clustering. The chapter provides an overview of the common tasks and data sets, evaluation metrics, and algorithms most commonly used in this area of study.

READ LESS

Summary

Topic identification

Direct and latent modeling techniques for computing spoken document similarity

December 12, 2010

Conference Paper

Author:

Timothy J. Hazen

Published in:

SLT 2010, IEEE Workshop on Spoken Language Technology, 12-15 December 2010.

Topic:

topic identification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Document similarity measures are required for a variety of data organization and retrieval tasks including document clustering, document link detection, and query-by-example document retrieval. In this paper we examine existing and novel document similarity measures for use with spoken document collections processed with automatic speech recognition (ASR) technology. We compare direct vector space approaches using the cosine similarity measure applied to feature vectors constructed with various forms of term frequency inverse document frequency (TF-IDF) normalization against latent topic modeling approaches based on latent Dirichlet allocation (LDA). In document link detection experiments on the Fisher Corpus, we find that an approach that applies bagging to models derived from LDA substantially outperforms the direct vector space approach.

READ LESS

Summary

Direct and latent modeling techniques for computing spoken document similarity

Multi-class SVM optimization using MCE training with application to topic identification

March 15, 2010

Conference Paper

Author:

Timothy J. Hazen

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 15 March 2010, pp. 5350-5353.

Topic:

topic identification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper presents a minimum classification error (MCE) training approach for improving the accuracy of multi-class support vector machine (SVM) classifiers. We have applied this approach to topic identification (topic ID) for human-human telephone conversations from the Fisher corpus using ASR lattice output. The new approach yields improved performance over the traditional techniques for training multi-class SVM classifiers on this task.

READ LESS

Summary

Multi-class SVM optimization using MCE training with application to topic identification

A hybrid SVM/MCE training approach for vector space topic identification of spoken audio recordings

September 22, 2008

Conference Paper

Author:

Timothy J. Hazen

…

Frederick S. Richardson

Published in:

INTERSPEECH 2008, 22-26 September 2008, pp. 2542-2545.

Topic:

topic identification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

The success of support vector machines (SVMs) for classification problems is often dependent on an appropriate normalization of the input feature space. This is particularly true in topic identification, where the relative contribution of the common but uninformative function words can overpower the contribution of the rare but informative content words in the SVM kernel function score if the feature space is not normalized properly. In this paper we apply the discriminative minimum classification error (MCE) training approach to the problem of learning an appropriate feature space normalization for use with an SVM classifier. Results are presented showing significant error rate reductions for an SVM-based system on a topic identification task using the Fisher corpus of audio recordings of human conversations.

READ LESS

Summary

A hybrid SVM/MCE training approach for vector space topic identification of spoken audio recordings

Publications

Refine Results

Tagged As

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Showing Results