Janani BalajiPierre BhoorasinghAshlynn Daughton
Xinli GengSaurabh JhaRyan Lee
Raghava MutharajuQi SongSowmya Sridhar
Nancy (Xin Ru) Wang

Janani Balaji

Department of Computer Science
Georgia State University

Poster 63

X-team 6

Whitepaper:
Challenges in Massive Graph Databases

Graph databases offer an efficient way to store and access inter-connected data. However, as the graph size grows, in order to query the entire graph, it becomes necessary to make multiple trips to the storage device to filter and gather data based on the query. But I/O accesses are expensive operations and immensely slow down query response time and prevent us from fully exploiting the graph specific benefits that graph databases offer. There are a few shortcomings unique to graphs, that prevent us from developing a high performance graph database without compromising on scalability. This white paper briefs through those challenges and suggests some solutions to overcome them.

Bio:

I am a PhD student at the department of Computer Science at Georgia State University. My area of research is graph databases. I am specifically interested in developing a fast and scalable solution for storing and retrieving large graphs. My current research focus is to develop a completely distributed graph storage structure that works both as an online graph database and also provides a low-latency platform for executing complex graph analysis operations on big graph data.

Interest areas:
Big Graph DatabasesBig Graph Data AnalysisGraph Algorithms

Pierre Bhoorasingh

Chemical Engineering
Northeastern University

Poster 65

X-team 6

Whitepaper:
Species identification of detailed kinetic models for direct comparison

The typical publication format of chemical kinetic models does not contain species connectivity information, preventing efficient model comparison. A semi-automated tool has been developed to assist a user in determining the species connectivities in a kinetic model. The tool has helps identify large kinetic and thermodynamic discrepancies between models as well as duplicate species within the same kinetic model.

Bio:

Automated tools for improved mechanism generation

Interest areas:
Chemical KineticsMechanism Generation

Ashlynn Daughton

Systems Analysis and Surveillance
Los Alamos National Lab

Poster 10

X-team 6

Whitepaper:
Use of Historic Disease Data to Facilitate Awareness and Inform Control Measures

Infectious diseases have recognizable patterns that have been documented for decades, but have not been fully exploited. We have developed an application that seeks to use similarities in historic infectious disease outbreak data to inform situational awareness of current outbreaks for a wide number of infectious diseases, even in contexts where minimal data is available.

Bio:

I am in a team at Los Alamos National Laboratory that works to develop tools and applications to facilitate situational awareness in the event of a disease outbreak. Our current tools include the SWAP, the BRD and the BARD. The SWAP is an analytic that uses historic outbreak data to better understand current outbreaks, while the BRD and the BARD are databases of public health data and epidemiological models. Current research focuses on military infectious disease outbreaks, and comparisons of that data to outbreaks in civilian populations.

Interest areas:
Infectious disease Disease surveillanceNon-traditional data streams for disease surveillance

Xinli Geng

Department of Civil and Environmental Engineering
university of nevada, reno

Poster 99

X-team 6

Whitepaper:
Vehicle to Pedestrian Communication Based on Client-Server Architecture

In this white paper, I present an architecture for the Vehicle-to-Pedestrian system, which would improve the safety of pedestrians and make a significant contribution to the development of connected vehicle. But recently, there are still some challenges to complete the full system.

Bio:

When I was a master, I do some analysis about the wireless communication data by Hadoop, and now I am a Phd student, mainly do some analysis about the connected vehicle and transportation data.

Interest areas:
big data, data analysisconnected vehicle, atonumous vehicle controldriver behavior modeling

Saurabh Jha

Computer Science
University of Illinois at Urbana Champaign

Poster 20

X-team 6

Whitepaper:
LASE: Log Analysis and Storage Engine for Resiliency Study

There is a need to build exascale computers to further the progress of scientific studies and meet the ever growing demands of computing power. One of the critical problems – if not the most critical problem – in reaching exascale computing goal by the end of the decade is “designing fault tolerant applications and systems that can reach sustained petaflops to exaflops of performance”. Due to high number of errors and failures in such complex systems, it has become important to understand the reasons for errors and failures and take proactive action to contain the errors to support exascale computation. Logs serve as important source of information to do such study. However, due to nature and scale of these logs, it has become difficult if not impossible to process and extract meaningful information from these logs. LASE is a log analysis and storage engine that bring various techniques together in an unified framework that can handle petabytes of data and assist in building models for failure diagnosis, prediction and anomaly detection.

Bio:

My research focuses on building large scale HPC and cloud systems, with a special focus on building resilient next-generation exascale systems for solving complex computational problems. Today, peak performance is growing faster than resilience. We have been able to attain petaflops of performance, but we do not have methods to ensure that these systems can run applications at sustained petaflop performance for long time. This is because, due to the the scale and complexity of the systems, the applications running on these systems encounter huge number of errors and failures and we do not have methods to handle these errors at scale. It is expected that errors and failures will rise exponentially with increasing scale of the system. Although, as a designer, it is important to reduce errors and failures in the first place, however given the scale of the systems, errors are inevitable. Thus, there is a need to come up with techniques that will allow application to {compute through errors and failures}. In order to build the next generation of large scale systems, we need to learn from the past experiences of deploying clouds and HPC systems at scale by analyzing data logs from these systems. The challenge in analyzing these logs are twofold: 1. Large dataset size in order of petabytes 2. Noise in the dataset I am trying to build frameworks and scalable algorithms for analyzing these system logs.

Interest areas:
Design of Fault Tolerant and Resilient SystemsData ScienceSystems

Ryan Lee

Department of Civil and Environmental Engineering
Villanova University

Poster 62

X-team 6

Whitepaper:
Data Challenges in Stormwater Research: Extracting Event-Based Datasets from Hydrologic Monitoring Databases

Environmental monitoring involving large amounts of time series data is increasingly used in research to improve stormwater green infrastructure. The data easily exceeds the capacity of standard engineering desktop software, and many engineers and researchers are left unable to properly utilize the data being generated. Tools or methods are needed that combine time series analysis and/or functional processing languages with database queries to make the proper working dataset accessible to the engineering community.

Bio:

I am interested in integrating probabilistic uncertainties into hydrologic models, and learning about data science and its application to environmental data. I am a part-time data manager and analyst for the Water Resources Engineering group at Villanova University as well as a stay-at-home father.

Interest areas:
Hydrologic Data ScienceApplied Statistics

Raghava Mutharaju

Computer Science
Wright State University

Poster 13

X-team 6

Whitepaper:
Distributed Reasoning over Ontology Streams and Large Knowledge Base

With the rapid increase in the velocity and the volume of data, it is becoming increasingly difficult to effectively analyze the data so as to extract knowledge from it. Use of background knowledge (domain knowledge captured in the form of ontologies) and reasoning (to correlate and infer facts) can prove to be useful in tackling the Big Data monster. But existing reasoning approaches are not scalable. In this paper, we present a distributed reasoning solution that can scale with the data.

Bio:

I work on distributed ontology reasoning and SPARQL query processing. I am also interested in ontology modeling and Semantic Web applications.

Interest areas:
Knowledge RepresentationReasoningDistributed Systems

Qi Song

School of Electrical Engineering and Computer Science
Washington State University

Poster 28

X-team 6

Whitepaper:
Knowledge Search Made Easy: Effective Knowledge Graph Summarization and Applications

The rising Big Data tide requires powerful techniques to effectively search useful knowledge from information systems such as knowledge bases and knowledge graphs. Accessing and search complex knowledge graphs is difficult for end users due to query ambiguity, data heterogeneity and large scale data. We propose to develop effective knowledge summarization techniques to make the knowledge search process easy for end users. The knowledge graph summaries not only help users understand the complex knowledge data and search results, but can also suggest reasonable queries and support fast knowledge search. Our research will benefit a number of knowledge discovery applications including web and scientific search, social network analysis, cyber security and health informatics.

Bio:

I am a PhD student who works with Yinghui Wu and Jana Doppa in School of Electrical Engineering and Computer Science in Washington State University. I got my master’s and bachelor’s degree from Beihang University( Beijing, China ) in 2015 and 2012. My research interests are Big Data and graph database systems, especially using graph databases and querying techniques to handle knowledge graph. With the development of knowledge base, graph related techniques can be used to find useful knowledge behind the data. But querying large, heterogeneous graphs are expensive especially under various time constraints and cognitive limits of end users. We use graph summarization techniques to help perform query on small but information-preserved graph summaries, which will reduce the query space and time. Further ,we want to explore some query optimizations to make a tradeoff between query accuracy and execution time.

Interest areas:
Data science, Big Data, Database, Machine learning

Sowmya Sridhar

Computer Science
New York University, School of Engineering

Poster 98

X-team 6

Whitepaper:
Weather Data Characterization Tool

The use of weather data in data analytics has become widely prevalent for a variety of applications, which aid in strategic business decision making for disciplines ranging from healthcare, transportation and planning, to economics, and in research investigations in the physical sciences. This has increased the demand for access to weather data in the form of daily, monthly, and annual summaries which contain values indicating the temperature (min, max, and average), precipitation, wind speed, snowfall and other parameters in record format. Consumers of the open weather dataset provided by the National Centers for Environmental Information [1] face a major challenge: they often lack the requisite domain knowledge to interpret the detailed meteorological data available. Hence, data scientists and other consumers of the data must write their own versions of tooling to translate the values into meaningful descriptions. The objective of this paper is to advocate for open data set publishers to supply data set interpretation tools (DSITs) to facilitate consumption of the released data. Specifically, we describe a DSIT for characterizing the weather on a particular day. The tool produces a data set, which is consumable by anyone without requiring them to possess expertise in the meteorological domain. DSITs are themselves analytics in that they are encoded representations of expert knowledge, which provide actionable insight.

Bio:

I'm an NYU student, pursuing Masters in Computer Science. My areas of interests lies in algorithm design and data science.

Interest areas:
Data ScienceAlgorithm Design

Nancy (Xin Ru) Wang

Computer Science and Engineering
University of Washington

Poster 43

X-team 6

Whitepaper:
Decoding neural signals with natural multimodal data

This paper outlines our project that will use deep and unsupervised techniques to analyze a large multimodal natural (non-experimental) dataset, including simultaneous video, audio and ECoG/EEG signals for computational neuroscience and brain-computer interface applications. This project combines techniques from multiple fields in order to fully leverage the multimodality of the dataset.

Bio:

I have been in my PhD program in Computer Science and Engineering at the University of Washington in Seattle, Washington since September 2014. I am a part of GRIDLab and Neural Systems Lab, primarily advised by Dr. Rajesh Rao. I am also advised by Dr. Jeff Ojemann and Dr. Ali Farhadi with numerous collaborations with many awesome scientists. I am fascinated by brains, computers, and any cognitive system. In my research, my goal is to learn algorithms from the brain to help machine learning and use state of the art machine learning techniques to learn more about the brain.

Interest areas:
Computational NeuroscienceBrain Machine InterfaceMachine learning