Zachary FosterZane GoodwinVanessa Gray
Mayank KejriwalJames MortonAlexandra Munoz
Shelly TriggRobert TunneyYisha Yao

Zachary Foster

Botany and Plant Pathology
Oregon State University

Poster 47

X-team 9

Whitepaper:
Automated website generation for reproducible and shareable data science

Describes the potential benefits and challenges of using literate programming for embedded documentation in data science projects and introduces a new R package under development that generates website representations of project folders. It uses the names of files/folders and options specified in configuration files to infer a menu hierarchy and organize the content of files. Literate programming documents are executed and their output is integrated into the website along with PDF files, images, and other HTML files in the project.

Bio:

I study rhododendron rhizosphere ecology using amplicon metagenomics. I am also interested in creating R packages to make it easier to maintain digital lab notebooks and conduct amplicon metagenomics research.

Interest areas:
Microbial soil ecologyMetabarcodingR package development

Zane Goodwin

Department of Medicine
Washington University in St. Louis

Poster 84

X-team 9

Whitepaper:
Identifying Antibiotic Resistance Genes In Hospital-Acquired Bacterial Infections

Hospital-acquired bacterial infections can cause a wide range of untreatable infections, which without antibiotic treatment will lead to fatality and therefore pose a significant threat to the health and recovery of hospitalized individuals. The identification of antibiotic resistance genes (ARGs) in the genome sequences of hospital-acquired infections can help identify the genes driving antibiotic resistance. Straightforward phylogenetic methods have been employed to solve this problem, but computational complexity of phylogenetic methods scales exponentially with the number of genomes being compared. Compositional methods have been introduced to help reduce the search space for ARGs in bacterial genome sequences, but they cannot pinpoint the exact location of ARGs in a genome. Hence, a combination of phylogenetic and compositional methods is needed in order to identify ARGs with a high degree of specificity. Solving the big data problem of how to best identify ARGs in large bacterial genomics datasets will help to answer larger biological questions about how ARGs appear and evolve in bacterial populations and provide practical insights that can help prevent outbreaks of antibiotic-resistant bacteria in hospitals.

Bio:

I have completed my fourth year in the Computational and Systems Biology program at Washington University in Saint Louis. My thesis research involves the study of evolutionary selective pressure acting on genes that are required for skin barrier development, and I maintain an active interest in studying co-evolutionary trends between skin microbes and their human hosts. I address these issues using computational statistical models of genome, gene and protein evolution, data mining and next-generation sequencing.

Interest areas:
BioinformaticsMachine LearningStatistics

Vanessa Gray

Genome Sciences
University of Washington

Poster 81

X-team 9

Whitepaper:
Harnessing large-scale mutagenesis data to improve protein engineering

This proposal aims to couple machine learning with large-scale mutagenesis datasets to accurately predict quantitative mutational effects. The resulting computational tool will have the capability to predict function-enhancing mutations, and will facilitate protein engineering innovation.

Bio:

I am a third year Ph.D. student who is passionate about using machine learning to uncover the relationship between protein sequence and function.

Interest areas:
GenomicsMachine learningProtein mutations

Mayank Kejriwal

Computer Science
University of Texas at Austin

Poster 77

X-team 9

Whitepaper:
Unsupervised Instance Matching on Schema-free Linked Data

Linked Data is a global effort that has resulted in the publication of knowledge bases such as Freebase and DBpedia. This white paper describes a longstanding Artificial Intelligence problem called instance matching, and its emergence as a Big Data problem in the Linked Data community. It also provides a high-level outline of an architectural solution being developed by the author as part of his research efforts.

Bio:

I'm currently a doctoral candidate at the University of Texas at Austin, advised by Daniel P. Miranker. My work concerns a longstanding Artificial Intelligence problem called instance matching, which has recently emerged as a Big Data problem in the Semantic Web community. For more details on my research, please visit my website at kejriwalresearch.azurewebsites.net

Interest areas:
Instance MatchingLinked DataSemantic Web

James Morton

Computer Science
University of California San Diego

Poster 6

X-team 9

Whitepaper:
Uncovering the Unknown: A New Approach in Analyzing Microbiome Data

In microbiome studies, the process of normalizing samples is still in the midst of immense debate. We argue that the most straightforward approach for normalizing samples is to calculate the proportions of species for each sample. In this paper we introduce a novel statistic for estimating multinomial proportions when the total number of possible species is unknown. Here we will show why using observed species abundances to estimate proportions are poor estimators for the true proportions and how coverage estimators can enhance accuracy of true proportion estimators.

Bio:

I am interested in developing theory and applying machine learning techniques to analyze microbial environments.

Interest areas:
Computational BiologyMachine LearningMicrobial Ecology

Alexandra Munoz

Environmental Medicine
New York University

Poster 8

X-team 9

Whitepaper:
Re-evaluating the paradigmatic presuppositions of molecular biology in the context of big data

Every piece of information that is extracted in data analysis also assumes a model ā€“ without the model the data would not tell you anything ā€“ there would be no context through which to relate the variables and the magnitude of the values would be meaningless. The molecular landscape is modeled in a DNA-centric manner that prioritizes certain types of information (singularities) over others (dynamic processes) and in turn constructs a system in which certain avenues of causality are not being fully integrated into the model. In turn, this paper critiques the current model and points to a direction for alternative exploration. The motivation for this work is to model the complexity of cancer in a new way, in an effort to expand the search area for the solution to cancer.

Bio:

I am deeply interested in evaluating the relevancy of currently used biological models in the context of modern informatics. Information is only as relevant as the model within which it is positioned. Since the discovery of DNA, the capacity to see into the molecular landscape has changed with great magnitude and revision to the biological model may be necessary. That is, perhaps the way we position information in molecular biology is dated due to its attachment to a model whose foundation rests in a time when our view of the molecular world was highly limited relative to what is visible now. Perhaps such limitation is not only a matter of adding to the old model, as we have done, but instead of fundamentally redefining it in accordance with the massive amount of information we have acquired - information that yields insight into more the foundational causal mechanisms in the realms of chemistry and quantum physics.

Interest areas:
GenomicsToxicologyParadigmatic Assumptions

Shelly Trigg

Biological Sciences
University of California, San Diego

Poster 58

X-team 9

Whitepaper:
Finding biological relevance in large-scale protein network studies

Proteome-wide protein-protein interaction screening has recently been made possible by coupling pooling strategies with next-generation sequencing. The unprecedented quantity of data being generated makes it infeasible to efficiently retest every interaction found for biological relevance. This work discusses strategies for prioritizing interactions found in interactome screens based on their projected phenotypic influence.

Bio:

Iā€™m interested in network connectivity and understanding how differences in protein-protein interaction networks contribute to phenotype. To study this, I joined the Ecker lab in November 2011 to develop a next-generation sequencing technology capable of identifying proteome-wide protein-protein interactions. I have since become a PhD candidate in the Biological Sciences program at UCSD and continue generating interaction data to build comprehensive interactome maps. The goal of my research is to illustrate the role of interactomics in the central dogma by identifying new drivers of biological processes and potential drug targets, and enabling a new assessment of protein essentiality. I hope to inspire a systems-level perspective by integrating visual representations and software I generate in higher education biology curriculum.

Interest areas:
Network BiologyGenomics

Robert Tunney

Computational Biology
University of California, Berkeley

Poster 61

X-team 9

Whitepaper:
Accurate Site Assignment of Ribosome Footprint Data

This paper proposes a method for accurate A site assignment of complete ribosome footprint data. This task is performed in order to analyze codon level regulatory features of translation. Current methods use heuristic A site assignment rules, based on the canonical length of ribosome footprints and position of E/P/A sites in those canonical size footprints. This paper proposes an expectation maximization algorithm to learn the parameters that generate footprint data about an A site. It increases the amount of usable data by performing maximum likelihood site assignment for ambiguous reads, which were previously discarded.

Bio:

I'm a Ph.D student in computational biology at UC Berkeley. My background is mostly in biology, CS, math and statistics. I work in genomics, particularly in computational analysis of RNA sequencing experiments.

Interest areas:
GenomicsRNA-Sequencing experimentsTranslation biology

Yisha Yao

Biochemistry
Rutgers, the State University of New Jersey

Poster 89

X-team 9

Whitepaper:
Combine spectral learning with advanced force field for protein structure prediction

It describes a methods that combines machine learning with advanced force field to improve protein structure prediction.

Bio:

I am a second-year PhD student focusing on structure bioinformatics. Currently, I am try to implement and extend state-of-the-art statistical methods to analyze massive data sets from NMR experiments. In 2014, our group is involved in organizing the 11th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP11), particularly the Contacts-Assisted Category. I simulated NMR data based on the solved structures, and generated the ambiguous residue-residue contacts. Such ambiguous contacts were assumed to provide valuable information for structure prediction, and it did. Contacts-assisted modeling is significantly improved than regular modeling. After assessing the results from different groups, we would know the current progress in the field of structure prediction and what direction efforts would be most productive. This work will result in a paper for the journal Proteins: Structure, Function and Bioinformatics.Among all the predictors, two groups did uniformly better for all the targets. I am learning from their methods and developing a proposal which integrates algorithmic tools (e.g. machine learning) and statistical models for tertiary structure prediction. Meanwhile, I am doing another project which compares different structure refinement methods: physics-based method (AMBER refinement) and informatics-based method (Rosetta refinement). Hopefully, I can get some inspiration from this work. I Hope I could get the chance to attend this workshop. I believe it would help my research a lot----meeting all those wonderful researchers and talking with other graduate students.

Interest areas:
computational biology/ bioinformaticsmath/statistics