Cosmological experiments in condensed matter system
We studied topological defects in hexagonal manganites to help understand the evolution of our Universe based on Kibble-Zurek mechanism. In this work, we have to account defect density in a large scale and the coordinates of each vortex core on some optical images. It takes us several months to do this data analysis. We are seeking more efficient way to do this work. And it will take us huge convenience in the future research.
Julie van der Hoop
Joint Program in Oceanography (Biology)
Integrating animal sensing systems
The next breakthroughs in wearable technology, for humans or animals, require integrated sensing systems.
School of Computing, Informatics, and Decision Systems Engineering
Discovering Bias in Big Social Media Data
One fundamental problem with social media mining is getting access to representative, reliable data. While companies like Facebook have massive amounts of data, they do not share this data with the research community at large. For the few sites that do share their data, they do so through the use of APIs that allow the researcher access to a portion of the overall data generated on the site. Twitter, one example of a social media site that shares its data, allows researchers to at most 1% of all of the posts generated on the site each day through its API. Twitter is perhaps the most lenient when it comes to sharing data with the research community. While Twitter’s APIs come as a welcome relief to those in the area of social media mining, their ability to represent the true activity on the social media site has become a concern to researchers in recent years. The problem of finding representative samples of social media is a widely accepted and necessary problem that researchers must address in order to ensure the veracity of their research results. Herein we define the problem and outline two state-of-the-art solutions.
Environmental Science / Ecosystem Restoration
Color Analysis of Crowdsourced Images for Ecological Monitoring
Remote sensing technology, such as satellite imagery, is a powerful tool for studying spatial ecology. However, understanding spatial ecology often requires finer scales than is afforded by satellite imagery, and the need for “ground-truthing” still exists. Leveraging “Big Data,” or more specifically, geo-tagged and time-stamped images provided through open source online networks, may offer a solution to help better understand scale and pattern in ecological systems.
Climate change refuges in the oceans
Identify coral reef refugia in the Pacific, Indian and Atlantic Oceans under differing climate change scenarios using climate-envelope models in accordance with high-resolution environmental data at a global scale
Uncovering the Unknown: A New Approach in Analyzing Microbiome Data
In microbiome studies, the process of normalizing samples is still in the midst of immense debate. We argue that the most straightforward approach for normalizing samples is to calculate the proportions of species for each sample. In this paper we introduce a novel statistic for estimating multinomial proportions when the total number of possible species is unknown. Here we will show why using observed species abundances to estimate proportions are poor estimators for the true proportions and how coverage estimators can enhance accuracy of true proportion estimators.
Large Scale Adaptive Anonymity via Parallel Approximate b-Matching
Data privacy is a necessary feature for data science applications. We discuss the potential of k-Anonymity, a privacy algorithm in the context of big data. We show some of the limitations of k-Anonymity and propose a heuristic solution to solve those problems. We also present the applicability of k-Anonymity to different domains of Data Sciences.
Re-evaluating the paradigmatic presuppositions of molecular biology in the context of big data
Every piece of information that is extracted in data analysis also assumes a model – without the model the data would not tell you anything – there would be no context through which to relate the variables and the magnitude of the values would be meaningless. The molecular landscape is modeled in a DNA-centric manner that prioritizes certain types of information (singularities) over others (dynamic processes) and in turn constructs a system in which certain avenues of causality are not being fully integrated into the model. In turn, this paper critiques the current model and points to a direction for alternative exploration. The motivation for this work is to model the complexity of cancer in a new way, in an effort to expand the search area for the solution to cancer.
Learning Efficient Representations for Sequence Retrieval
We explore the problem of matching sequences of high-dimensional vectors to entries in very large sequence databases. When utilizing dynamic time warping distance to compare sequences, the local distance calculations can be prohibitively expensive when the data's dimensionality and intrinsic sampling rate is high. We therefore motivate the need for methods which can learn efficient representations for sequence comparison and discuss potential applications of these techniques.
Systems Analysis and Surveillance
Use of Historic Disease Data to Facilitate Awareness and Inform Control Measures
Infectious diseases have recognizable patterns that have been documented for decades, but have not been fully exploited. We have developed an application that seeks to use similarities in historic infectious disease outbreak data to inform situational awareness of current outbreaks for a wide number of infectious diseases, even in contexts where minimal data is available.
Ecology and Evolution
A pipeline for combining crowd-sourced images and computer vision to monitor plant flowering
Using images gathered from the Flickr photo sharing cite to collect data on the timing of flowering plants in Mt. Rainier National Park.
Integrative Biology, Center for Computational Biology and Bioinformatics
White Paper: Integrative Neuroscience
The brain is a fascinatingly complex organ that has been the subject of intense study for centuries, but many mysteries about the brain still remain. Current brain initiatives call for multi-scale integration of the activity and structure of the brain in order to elucidate and link the neural circuits dynamics to brain function. As cutting edge technologies are being developed, neuroscience critically depends on developing both data repositories, analysis tools and theories for integrating real-time genomic, connectomic, optogeneic, electrophysiological, and behavioral data.
Distributed Reasoning over Ontology Streams and Large Knowledge Base
With the rapid increase in the velocity and the volume of data, it is becoming increasingly difficult to effectively analyze the data so as to extract knowledge from it. Use of background knowledge (domain knowledge captured in the form of ontologies) and reasoning (to correlate and infer facts) can prove to be useful in tackling the Big Data monster. But existing reasoning approaches are not scalable. In this paper, we present a distributed reasoning solution that can scale with the data.
Materials Science Engineering
Data Mining and Machine Learning to Guide Novel Thermoelectric Development
This white paper describes the possible uses of thermoelectric materials, and addresses the problems associated with conducting high-risk studies to synthesize novel compounds from chemical white space. By data mining the ever-increasing number of materials science publications, a comprehensive database is being constructed. Newly-developed machine-learning systems are being used to predict the thermoelectic properties of hypothetical materials, and bridge the gap between computational tools and experimental needs.
The Smart City as a Platform for Collaboration on Climate Change
Cities are at the forefront of the fight against climate change, because their concentration of resources provides the most environmentally-friendly way of delivering a high quality of life. Sustainable cities combine this advantage with a society-wide commitment to a low-carbon lifestyle. Yet the traditional tools of public administration are poorly equipped to facilitate this collaborative approach. Fortunately, advancements in Information and Communication Technologies (ICT) hold tremendous potential to address these shortcomings. Most promisingly, innovative urban leaders have begun to reshape both government and governance around a vision of a “Smart City” that collects vast amounts of data on the state and performance of its communities and then translates this data into actionable insights. Yet the adoption of these “smart city” innovations remains best described as experimental, as blind aspirations continue to far exceed validated best practices or proven implementation strategies. To bridge this gap, the proposed research project will conduct holistic case studies of three pioneering "smart cities" to identify effective business models for using "smart" infrastructure, data science, and connected citizens to promote community-wide action on climate change.
Department of Computer Science
Multivariate Conditional Outlier Detection and Its Clinical Application
This paper summarizes our research that aims at developing automated methods of multivariate conditional outlier detection, and applying the methods to support clinical decision making. In particular, we are interested in identifying statistically unusual patient care patterns corresponding to medical errors based on data stored in electronic medical record (EMR) systems. We describe the problems and objectives of the research, and outline our model-based outlier detection approach. We also discuss the future directions and expected impacts of the research.
Electrical and Computer Engineering Department
Cloud K-SVD: A Dictionary Learning Algorithm for Big, Distributed Data
This paper studies the problem of data-adaptive representations for big, distributed data. It is assumed that a number of geographically-distributed, interconnected sites have massive local data and they are interested in collaboratively learning a low-dimensional geometric structure (dictionary) underlying these data.
Data Science and Analytics
Pixel oriented visualization – An aid to analyze large-scale text data
Classics scholars work with text data that is not just Big, but also Interesting and Complex. We are developing new pixel-based text visualization technique that display the hierarchical structure of primary texts with their rich apparatus metadata in an accessible and comparable fashion. As a part of this, we are investigating new ways to support focus+context interactions across multiple scales of text. These visualization designs will help scholars to engage effectively and efficiently with the long and deep provenance of knowledge that surrounds some of humanity's most important historical works. We anticipate that successful application of new interactive visualization techniques for text analysis to a complex domain like classics will provide a clear direction for application to scholarship and learning on text, language, and communication in a wide variety of domains.
Earth and Environment
Deriving process knowledge from data in coastal ecohydrology
Scientists interested in developing robust predictive models should aim for a synthetic modeling approach which combines the predictive power of empirical models with the process-driven understanding of physical models. I examine how this synthetic approach can improve the representation of processes in empirical models of salt marsh hydrology.
LASE: Log Analysis and Storage Engine for Resiliency Study
There is a need to build exascale computers to further the progress of scientific studies and meet the ever growing demands of computing power. One of the critical problems – if not the most critical problem – in reaching exascale computing goal by the end of the decade is “designing fault tolerant applications and systems that can reach sustained petaflops to exaflops of performance”. Due to high number of errors and failures in such complex systems, it has become important to understand the reasons for errors and failures and take proactive action to contain the errors to support exascale computation. Logs serve as important source of information to do such study. However, due to nature and scale of these logs, it has become difficult if not impossible to process and extract meaningful information from these logs. LASE is a log analysis and storage engine that bring various techniques together in an unified framework that can handle petabytes of data and assist in building models for failure diagnosis, prediction and anomaly detection.
Astronomy and Astrophysics
Classification of Intermediate-Luminosity Astronomical Transients
Stars materialize, live and die following a lifecycle that depends on both intrinsic properties and environmental factors. Their transient outbursts, interactions and deaths all encode important information about stellar evolution. Future large surveys, such as LSST, will produce 30+ TB of data daily which astronomers can use to study these transients. This paper describes possible classification techniques for analyzing the LSST dataset of intermediate-luminosity transients.
Seeing is believing with data visualization
This whitepaper outlines state of the art biodiversity identification practices and new visual methods using Kingdom Plantae and its children as models.
Improving data-management and integration within resequencing-pipelines
A powerful strategy for identifying the genetic basis of phenotypes is to perform genome-wide association (GWA) analysis. GWA studies that utilize massively parallel sequencing rely on population resequencing pipelines to identify genetic variants. Resequencing pipelines require precise data handling and integration of data generated across a large series of steps from multiple programs to identify issues, confounding factors in analysis, or to identify interesting associations. However, integrating this data is challenging as it requires extensive file parsing, manipulation, and merging. Here, I propose the development of a database schema resembling an entity-attribute-value (EAV) model for storage of summary data generated at different steps within resequencing pipelines and a set of tools enabling integration of this system. This system improves data-handling within resequencing pipelines and facilitates comparison of variables across tools, samples, and between pipeline configurations.
Spatial Regularization for Multitask Learning and Application in fMRI Data Analysis
fMRI data has extremely complicated structure. Hence efficient and accurate models are necessary in detecting accurate neuronal activity, by incorporating spatial and spectral information. In this paper, we use General Linear Model to formulate the fMRI data, where each voxel is assumed as a task, and proposed a class of spatial Multi-task Learning models, which incorporates spatial information provided by each task’s neighborhood. Simulation and real application results show satisfactory performance from spatial Multi-task Learning algorithms.
Improving Traffic Management Using Big Data
We study the routing problem for vehicle flows through a road network that includes both battery-powered Electric Vehicles (EVs) and Non-Electric Vehicles (NEVs). We seek to optimize a system-centric (as opposed to user-centric) objective aiming to minimize the total elapsed time for all vehicles to reach their destinations considering both traveling times and recharging times for EVs when the latter do not have adequate energy for the entire journey. We are validating the efficiency of our algorithm using real traffic data in terms of “average speed” on the road segments in Eastern Massachusetts provided by the City of Boston.
Computer Sc. Dept
Named Data Networking for Large Scientific Data Management
This paper discusses how using Named Data Networking (NDN) reduces the complexities of large scientific data management. Scientific data collections require safe archiving and easy retrieval while maintaining data provenance and integrity. The large size and distributed nature of these datasets complicates already challenging data management task. NDN (Named Data Networking), a NSF project for investigating future Internet architectures, replaces IP endpoints by hierarchical content names. NDN implicitly overcomes many of the challenges associated with managing scientific data. We describe a framework developed with NDN to reduce such challenges.
Department of Biostatistics
Nonparametric Cluster Significance Testing
We describe a proposed method for testing the statistical significance of putative clusters. Cluster analysis is an unsupervised learning strategy that can be used to identify groups of observations in data sets of unknown structure. Few methods are available that can assess the strength of clusters identified in a data set. The methods that are available often rely on distributional assumptions or are not optimized for high dimensional settings. We propose a novel non-parametric method for testing the null hypothesis that no clusters are present in a given data set which can be used in both high and low dimensional settings with optimal accuracy.
School of Electrical Engineering and Computer Science
Knowledge Search Made Easy: Effective Knowledge Graph Summarization and Applications
The rising Big Data tide requires powerful techniques to effectively search useful knowledge from information systems such as knowledge bases and knowledge graphs. Accessing and search complex knowledge graphs is difficult for end users due to query ambiguity, data heterogeneity and large scale data. We propose to develop effective knowledge summarization techniques to make the knowledge search process easy for end users. The knowledge graph summaries not only help users understand the complex knowledge data and search results, but can also suggest reasonable queries and support fast knowledge search. Our research will benefit a number of knowledge discovery applications including web and scientific search, social network analysis, cyber security and health informatics.
Department of Statistics and Biostatistics
A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data
The task of analyzing massive spatial data is extremely challenging. In this paper we propose a sequential split-conquer-combine (SSCC) approach for analysis of dependent big data and illustrate it using a Gaussian process model, along with a theoretical support. This SSCC approach can substantially reduce computing time and computer memory requirements. We also show that the SSCC approach is oracle in the sense that the result obtained using the approach is asymptotically equivalent to the one obtained from performing the analysis on the entire data in a super-super computer. The methodology is illustrated numerically using both simulation and a real data example of a computer experiment on modeling room temperatures.
Information Sciences and Technology
Clustering Distributions at Scale: A New Tool for Data Sciences
We introduce a fast and parallel tool for clustering large-scale discrete distributions under the optimal transport distance. The significant computational cost brought by the optimal transport has left the problems of machine learning from such unstructured data almost untouched till today. Our proposed optimization method successfully resolves the scalability bottleneck of previous methods, hence is readily applicable for analyzing large distributional dataset without first specifying which form the distribution of data has to comply.
Electrical Engineering and Computer Science
Learning Tailored Risk Scores from Large-Scale Datasets
Risk scores are simple models that let users assess risk by adding, subtracting and multiplying a few small numbers. These models are widely used in medicine and crime prediction but difficult to learn from data because they need to be accurate, sparse, and use integer coefficients. We formulate the risk score problem as a mixed integer non-linear programming problem, and present a cutting-plane algorithm to solve it for datasets with large sample sizes.
Mechanical and Energy Engineering
Big Data Mining Methods for Accurate Spatial Interpolation of Ozone Pollution
This paper explains the importance of using big data methods to extract accurate spatial interpolation functions for ozone pollution prediction.
Media Arts and Sciences
Large-scale analysis of novice programmer trajectories in an open-ended programming community
This white paper outlines some of the opportunities and challenges in analyzing trajectories of young novice programmers as they create, share, and remix media-rich programming projects, as well as participate socially in the Scratch online community (https://scratch.mit.edu). Scratch is open-ended by design, where anyone with a web-browser can create a wide variety of programming projects, ranging from games to science-simulations, from interactive stories to computational music programs. This open-ended context poses a number of challenges for the large-scale analysis and measurement of learning outcomes. Addressing these challenges hold promise not just for understanding the use of Scratch as a learning environment, but also, as the learn-to-code movement in the United States and elsewhere gathers momentum, methods and strategies formulated for Scratch data-research has the potential to be useful for research on other similar tools and environments that teach young people programming.
Towards Open World Recognition
With the of advent rich classification models and high computational power visual recognition systems have found many operational applications. Recognition in the real world poses multiple challenges that are not apparent in controlled lab environments. The datasets are dynamic and novel categories must be continuously detected and then added. At prediction time, a trained system has to deal with myriad unseen categories. Operational systems require minimal downtime, even to learn. To handle these operational issues, we present the problem of Open World Recognition and formally define it. We prove that thresholding sums of monotonically decreasing functions of distances in linearly transformed feature space can balance “open space risk” and empirical risk. Our theory extends existing algorithms for open world recognition.
Computational Social Science
Replicating Cyber-attack Patterns of Behavior using Bipartite Network Analysis and Agent-Based Modeling
Introducing a method of evaluating cyber traffic behavior via bipartite graph analysis and implementing agent-based modeling to simulate and test network capability.
Parsimonious model selection in genome-wide association studies
This white paper sketches an issue with model selection in multiple regression analysis of genome-wide association studies. Based on our current research, we suggest a remedy to perform these large analyses on a desktop machine.
System and Information Engineering
Maintained Individual Data Distributed Likelihood Estimation (MIDDLE)
Maintained Individual Data Distributed Likelihood Estimation (MIDDLE) paradigm will construct and validate a revolutionary model for the accomplishing health science human-subject research with networked devices. The MIDDLE paradigm is that data can be privately maintained by participants on their personal devices and never revealed to researchers, while statistical models are fit and scientific hypotheses are tested.
The Extreme Value Machine
This paper describes a scalable, non-linear model called the Extreme Value Machine (EVM), an analog to the Support Vector Machine (SVM) derived from statistical Extreme Value Theory. The EVM is far more scalable than a kernelized SVM, exhibits comparable accuracy on closed set datasets (where all classes are known at test time), and avoids the need for a parameter grid search. This allows the EVM model to scale to large datasets that are computationally infeasible for non-linear SVMs. Moreover, unlike SVMs, our EVM model performs well in the open-set regime (when unknown classes are present at test time), achieving state-of-the-art results on open-set datasets.
Department of Electrical Engineering & Computer Science
Collaborative data science without violating privacy: a case study from genome research
Data privacy is an important issue for many data science disciplines involving human subjects. Here we take genome research as a representative case study to illustrate the privacy concerns and countermeasures. We develop a novel cryptography-based method to enable collaborative studies via meta-analysis without violating privacy. We also show the relevance of this method to the wider research community of data science.
Why Data Science Needs to Attend to Contextual Behavior: The Case of Crisis Informatics
Crisis informatics is a study of how people converge, spread information, and cooperate around the tasks they deem important on social media in crisis. The socio-behavioral focus of crisis informatics necessitates that research methodology accounts for the social context of users’ activity. On the other hand, the volume of the social media data requires the use of data science approaches, which in the current form often decontextualize the social activity. I propose several methodological innovations that would propel big data methods towards attending to the highly-situated and contextual nature of the social activity in crisis.
Division of the Humanities and Social Sciences
Detecting Habitual Behavior in Natural Consumer Choice Data
Habit is a process by which a stimulus automatically generates an impulse toward action, based on learned association between stimulus and response. In this project we seek to identify habitual choices and shifts from habit to model-directed behavior using big and broad data sets of natural consumer decision making such as online shopping, online stock trading, and commuter route choice.
Data Mining to Predict Healthcare Utilization in Managed Care Patients
Systematic association mining of clinical attributes from the electronic health records of adult primary care patients to discover predictors of high healthcare utilization.
Nancy (Xin Ru) Wang
Computer Science and Engineering
Decoding neural signals with natural multimodal data
This paper outlines our project that will use deep and unsupervised techniques to analyze a large multimodal natural (non-experimental) dataset, including simultaneous video, audio and ECoG/EEG signals for computational neuroscience and brain-computer interface applications. This project combines techniques from multiple fields in order to fully leverage the multimodality of the dataset.
Kin Gwn Lore
Pattern Discovery from Large-scale Computational Fluid Dynamic Data using Deep Learning
This paper outlines our research in solving an inverse fluid dynamics design problem using large-scale simulation data. The forward problem of sculpting fluid flow by placing a set of pillars in a fluid channel has been simulated and experimentally validated. We now explore the applicability of machine learning models (specifically deep learning) in the inverse problem to serve as a map between user-defined flow shapes and the corresponding sequence of pillars in the design of microfluidic devices.
Computer Science and Engineering
Global Monitoring of Inland Water Dynamics: A Data-driven Approach
Freshwater, which is only available in inland water bodies such as lakes, reservoirs, and rivers, is increasingly becoming scarce across the world and this scarcity is posing a global threat to human sustainability. A global monitoring of inland water bodies is necessary for policy-makers and the scientific community to address this problem. The promise of data-driven approaches coupled with availability of remote sensing data presents opportunities as well as challenges for global monitoring of inland water bodies. My research aims at developing predictive models that address the challenges in analyzing remote sensing data for creating the first global monitoring system of inland water dynamics.
Incorporation of Genomic Data in US Cattle Breeding and Production
We are analyzing appropriate practices for routine incorporation of genetic information in both the selection and care of US livestock. The rapid decrease in cost of genetic data from chips or short read data has lead to the accumulation of large, profitable data sets that provide dense quantification of the genetic component of livestock production. Unlike human or model organism genetics, the regulatory environment surrounding livestock genomics has enabled immediate application of cutting edge genomic technology in a commercial setting. Low cost genomic data informs high leverage decision making at farms ranging in size and technical expertise, enabled by well shared central genomic data repositories that inform genomic breeding models. We seek to develop tools that lower costs and allow genomic information to provide value across the whole livestock production cycle from reproduction, to immunity to ecological sustainability and carcass quality.
Botany and Plant Pathology
Automated website generation for reproducible and shareable data science
Describes the potential benefits and challenges of using literate programming for embedded documentation in data science projects and introduces a new R package under development that generates website representations of project folders. It uses the names of files/folders and options specified in configuration files to infer a menu hierarchy and organize the content of files. Literate programming documents are executed and their output is integrated into the website along with PDF files, images, and other HTML files in the project.
Department of Computer Science
An Integrated Transport Solution to Big Data Movement in High-performance Networks
We propose and develop an integrated transport solution to big data movement in high-performance networks in support of data- and network-intensive scientific applications.
Determining Periodicity In Data
In the last few years high-throughput technologies have enabled the efficient and inexpensive collection of massive amounts of data. In many cases the data are high dimensional and being generated by some nonlinear system. In such a situation one is interested in both the geometry of the data and the action of the unknown nonlinear system. One of the most fundamental problems in analyzing nonlinear systems is determining whether the system is periodic. However, the past few decades of dynamical systems theory have shown nonlinear systems can exhibit extremely complex behavior with respect to both system variables and parameters. Such complex behavior proven in theoretical work has to be contrasted with the capabilities of application; in the case of modeling multiscale processes, for instance, measurements may be of limited precision, parameters are rarely known exactly and nonlinearities are often not derived from first principles. This contrast suggests that extracting a robust characterization of the periodic behavior is of greater importance than a detailed understanding of the fine structure. For such a characterization, we propose an approach which incorporates Takens’ embedding theorem, persistent homology and diffusion maps.
Department of Statistics and Biostatistics
Advanced Data Analytics of Railroad Infrastructure Degradation to Improve Transportation Safety
This white paper introduces some possible models to capture the track geometry degradation.