Publications

Refine Results

(Filters Applied) Clear All

Implicitly-defined neural networks for sequence labeling

Published in:
Annual Meeting of Assoc. of Computational Lingusitics, 31 July 2017.

Summary

In this work, we propose a novel, implicitly defined neural network architecture and describe a method to compute its components. The proposed architecture forgoes the causality assumption previously used to formulate recurrent neural networks and allow the hidden states of the network to coupled together, allowing potential improvement on problems with complex, long-distance dependencies. Initial experiments demonstrate the new architecture outperforms both the Stanford Parser and a baseline bidirectional network on the Penn Treebank Part-of-Speech tagging task and a baseline bidirectional network on an additional artificial random biased walk task.
READ LESS

Summary

In this work, we propose a novel, implicitly defined neural network architecture and describe a method to compute its components. The proposed architecture forgoes the causality assumption previously used to formulate recurrent neural networks and allow the hidden states of the network to coupled together, allowing potential improvement on problems...

READ MORE

Automated provenance analytics: a regular grammar based approach with applications in security

Published in:
9th Intl. Workshop on Theory and Practice of Provenance, TaPP, 22-23 June 2017.

Summary

Provenance collection techniques have been carefully studied in the literature, and there are now several systems to automatically capture provenance data. However, the analysis of provenance data is often left "as an exercise for the reader". The provenance community needs tools that allow users to quickly sort through large volumes of provenance data and identify records that require further investigation. By detecting anomalies in provenance data that deviate from established patterns, we hope to actively thwart security threats. In this paper, we discuss issues with current graph analysis techniques as applied to data provenance, particularly Frequent Subgraph Mining (FSM). Then we introduce Directed Acyclic Graph regular grammars (DAGr) as a model for provenance data and show how they can detect anomalies. These DAGr provide an expressive characterization of DAGs, and by using regular grammars as a formalism, we can apply results from formal language theory to learn the difference between "good" and "bad" provenance. We propose a restricted subclass of DAGr called deterministic Directed Acyclic Graph automata (dDAGa) that guarantees parsing in linear time. Finally, we propose a learning algorithm for dDAGa, inspired by Minimum Description Length for Grammar Induction.
READ LESS

Summary

Provenance collection techniques have been carefully studied in the literature, and there are now several systems to automatically capture provenance data. However, the analysis of provenance data is often left "as an exercise for the reader". The provenance community needs tools that allow users to quickly sort through large volumes...

READ MORE

SoK: cryptographically protected database search

Summary

Protected database search systems cryptographically isolate the roles of reading from, writing to, and administering the database. This separation limits unnecessary administrator access and protects data in the case of system breaches. Since protected search was introduced in 2000, the area has grown rapidly, systems are offered by academia, start-ups, and established companies. However, there is no best protected search system or set of techniques. Design of such systems is a balancing act between security, functionality, performance, and usability. This challenge is made more difficult by ongoing database specialization, as some users will want the functionality of SQL, NoSQL, or NewSQL databases. This database evolution will continue, and the protected search community should be able to quickly provide functionality consistent with newly invented databases. At the same time, the community must accurately and clearly characterize the tradeoffs between different approaches. To address these challenges, we provide the following contributions:(1) An identification of the important primitive operations across database paradigms. We find there are a small number of base operations that can be used and combined to support a large number of database paradigms.(2) An evaluation of the current state of protected search systems in implementing these base operations. This evaluation describes the main approaches and tradeoffs for each base operation. Furthermore, it puts protected search in the context of unprotected search, identifying key gaps in functionality.(3) An analysis of attacks against protected search for different base queries.(4) A roadmap and tools for transforming a protected search system into a protected database, including an open-source performance evaluation platform and initial user opinions of protected search.
READ LESS

Summary

Protected database search systems cryptographically isolate the roles of reading from, writing to, and administering the database. This separation limits unnecessary administrator access and protects data in the case of system breaches. Since protected search was introduced in 2000, the area has grown rapidly, systems are offered by academia, start-ups...

READ MORE

Fabrication security and trust of domain-specific ASIC processors

Summary

Application specific integrated circuits (ASICs) are commonly used to implement high-performance signal-processing systems for high-volume applications, but their high development costs and inflexible nature make ASICs inappropriate for algorithm development and low-volume DoD applications. In addition, the intellectual property (IP) embedded in the ASIC is at risk when fabricated in an untrusted foundry. Lincoln Laboratory has developed a flexible signal-processing architecture to implement a wide range of algorithms within one application domain, for example radar signal processing. In this design methodology, common signal processing kernels such as digital filters, fast Fourier transforms (FFTs), and matrix transformations are implemented as optimized modules, which are interconnected by a programmable wiring fabric that is similar to the interconnect in a field programmable gate array (FPGA). One or more programmable microcontrollers are also embedded in the fabric to sequence the operations. This design methodology, which has been termed a coarse-grained FPGA, has been shown to achieve a near ASIC level of performance. In addition, since the signal processing algorithms are expressed in firmware that is loaded at runtime, the important application details are protected from an unscrupulous foundry.
READ LESS

Summary

Application specific integrated circuits (ASICs) are commonly used to implement high-performance signal-processing systems for high-volume applications, but their high development costs and inflexible nature make ASICs inappropriate for algorithm development and low-volume DoD applications. In addition, the intellectual property (IP) embedded in the ASIC is at risk when fabricated in...

READ MORE

Twitter language identification of similar languages and dialects without ground truth

Published in:
Proc. 4th Workshop on NLP for Similar Languages, Varieties and Dialects, 3 April 2017, pp. 73-83.

Summary

We present a new method to bootstrap filter Twitter language ID labels in our dataset for automatic language identification (LID). Our method combines geolocation, original Twitter LID labels, and Amazon Mechanical Turk to resolve missing and unreliable labels. We are the first to compare LID classification performance using the MIRA algorithm and langid.py. We show classifier performance on different versions of our dataset with high accuracy using only Twitter data, without ground truth, and very few training examples. We also show how Platt Scaling can be use to calibrate MIRA classifier output values into a probability distribution over candidate classes, making the output more intuitive. Our method allows for fine-grained distinctions between similar languages and dialects and allows us to rediscover the language composition of our Twitter dataset.
READ LESS

Summary

We present a new method to bootstrap filter Twitter language ID labels in our dataset for automatic language identification (LID). Our method combines geolocation, original Twitter LID labels, and Amazon Mechanical Turk to resolve missing and unreliable labels. We are the first to compare LID classification performance using the MIRA...

READ MORE

Bounded-collusion attribute-based encryption from minimal assumptions

Published in:
IACR 20th Int. Conf. on Practice and Theory of Public Key Cryptography, PKC 2017, 28-31 March 2017.

Summary

Attribute-based encryption (ABE) enables encryption of messages under access policies so that only users with attributes satisfying the policy can decrypt the ciphertext. In standard ABE, an arbitrary number of colluding users, each without an authorized attribute set, cannot decrypt the ciphertext. However, all existing ABE schemes rely on concrete cryptographic assumptions such as the hardness of certain problems over bilinear maps or integer lattices. Furthermore, it is known that ABE cannot be constructed from generic assumptions such as public-key encryption using black-box techniques. In this work, we revisit the problem of constructing ABE that tolerates collusions of arbitrary but a priori bounded size. We present two ABE schemes secure against bounded collusions that require only semantically secure public-key encryption. Our schemes achieve significant improvement in the size of the public parameters, secret keys, and ciphertexts over the previous construction of bounded-collusion ABE from minimal assumptions by Gorbunov et al. (CRYPTO 2012). In fact, in our second scheme, the size of ABE secret keys does not grow at all with the collusion bound. As a building block, we introduce a multidimensional secret-sharing scheme that may be of independent interest. We also obtain bounded-collusion symmetric-key ABE (which requires the secret key for encryption) by replacing the public-key encryption with symmetric-key encryption, which can be built from the minimal assumption of one-way functions.
READ LESS

Summary

Attribute-based encryption (ABE) enables encryption of messages under access policies so that only users with attributes satisfying the policy can decrypt the ciphertext. In standard ABE, an arbitrary number of colluding users, each without an authorized attribute set, cannot decrypt the ciphertext. However, all existing ABE schemes rely on concrete...

READ MORE

Detecting virus exposure during the pre-symptomatic incubation period using physiological data

Summary

Early pathogen exposure detection allows better patient care and faster implementation of public health measures (patient isolation, contact tracing). Existing exposure detection most frequently relies on overt clinical symptoms, namely fever, during the infectious prodromal period. We have developed a robust machine learning method to better detect asymptomatic states during the incubation period using subtle, sub-clinical physiological markers. Using high-resolution physiological data from non-human primate studies of Ebola and Marburg viruses, we pre-processed the data to reduce short-term variability and normalize diurnal variations, then provided these to a supervised random forest classification algorithm. In most subjects detection is achieved well before the onset of fever; subject cross-validation lead to 52±14h mean early detection (at >0.90 area under the receiver-operating characteristic curve). Cross-cohort tests across pathogens and exposure routes also lead to successful early detection (28±16h and 43±22h, respectively). We discuss which physiological indicators are most informative for early detection and options for extending this capability to lower data resolution and wearable, non-invasive sensors.
READ LESS

Summary

Early pathogen exposure detection allows better patient care and faster implementation of public health measures (patient isolation, contact tracing). Existing exposure detection most frequently relies on overt clinical symptoms, namely fever, during the infectious prodromal period. We have developed a robust machine learning method to better detect asymptomatic states during...

READ MORE

SIAM data mining "brings it" to annual meeting

Summary

The Data Mining Activity Group is one of SIAM's most vibrant and dynamic activity groups. To better share our enthusiasm for data mining with the broader SIAM community, our activity group organized six minisymposia at the 2016 Annual Meeting. These minisymposia included 48 talks organized by 11 SIAM members.
READ LESS

Summary

The Data Mining Activity Group is one of SIAM's most vibrant and dynamic activity groups. To better share our enthusiasm for data mining with the broader SIAM community, our activity group organized six minisymposia at the 2016 Annual Meeting. These minisymposia included 48 talks organized by 11 SIAM members.

READ MORE

Learning by doing, High Performance Computing education in the MOOC era

Published in:
J. Parallel Distrib. Comput., Vol. 105, July 2017, pp. 105-15.

Summary

The High Performance Computing (HPC) community has spent decades developing tools that teach practitioners to harness the power of parallel and distributed computing. To create scalable and flexible educational experiences for practitioners in all phases of a career, we turn to Massively Open Online Courses (MOOCs). We detail the design of a unique self-paced online course that incorporates a focus on parallel solutions, personalization, and hands-on practice to familiarize student-users with their target system. Course material is presented through the lens of common HPC use cases and the strategies for parallelizing them. Using personalized paths, we teach researchers how to recognize the alignment between scientific applications and traditional HPC use cases, so they can focus on learning the parallelization strategies key to their workplace success. At the conclusion of their learning path, students should be capable of achieving performance gains on their HPC system.
READ LESS

Summary

The High Performance Computing (HPC) community has spent decades developing tools that teach practitioners to harness the power of parallel and distributed computing. To create scalable and flexible educational experiences for practitioners in all phases of a career, we turn to Massively Open Online Courses (MOOCs). We detail the design...

READ MORE

Interactive synthesis of code-level security rules

Author:
Published in:
Thesis (M.S.)--Northeastern University, 2017.

Summary

Software engineers inadvertently introduce bugs into software during the development process and these bugs can potentially be exploited once the software is deployed. As the size and complexity of software systems increase, it is important that we are able to verify and validate not only that the software behaves as it is expected to, but also that it does not violate any security policies or properties. One of the approaches to reduce software vulnerabilities is to use a bug detection tool during the development process. Many bug detection techniques are limited by the burdensome and error prone process of manually writing a bug specification. Other techniques are able to learn specifications from examples, but are limited in the types of bugs that they are able to discover. This work presents a novel, general approach for synthesizing security rules for C code. The approach combines human knowledge with an interactive logic programming synthesis system to learn Datalog rules for various security properties. The approach has been successfully used to synthesize rules for three intraprocedural security properties: (1) out of bounds array accesses, (2) return value validation, and (3) double freed pointers. These rules have been evaluated on randomly generated C code and yield a 0% false positive rate and a 0%, 20%, and 0% false negative rate, respectively for each rule.
READ LESS

Summary

Software engineers inadvertently introduce bugs into software during the development process and these bugs can potentially be exploited once the software is deployed. As the size and complexity of software systems increase, it is important that we are able to verify and validate not only that the software behaves as...

READ MORE