# Introduction to Computational Molecular Biology

Together with Genome 541, a two-quarter introduction to protein and DNA sequence analysis and molecular evolution, including probabilistic models of sequences and of sequence evolution, computational gene identification, pairwise sequence comparison and alignment (algorithms and statistical issues), multiple sequence alignment and evolutionary tree construction, comparative genomics, and protein sequence/structure relationships. These are the central computational methods required to determine the "periodic table of biology", i.e. the list of proteins and their evolutionary relationships, which can be regarded as the first stage in the growth of molecular biology into a quantitative science. Moreover, the statistical and algorithmic methods used (which include maximum likelihood estimation, hidden Markov models, dynamic programming) have wide applicability in other areas of computational & mathematical biology.

The entire course grade is based on the homework assignments, which are due weekly (more or less). No tests or exams.

- The homework assignments involve writing programs for data analysis, and running them on a computer you have access to (we cannot provide computers). We don't require a specific language, since it is not practical for us to grade your code, just the output from running your programs. However it is important to use a language that allows you to write programs that will run fast on large datasets; ideally a compiled language such as C or C++. You may be able to get by with an interpreted language such as Perl or Java (some people have), however there is a definite risk that programs in these languages will take very long to run -- which means that (i) you will need access to a very fast computer to run them on, and/or (ii) you will need to get the program written early enough that you can allow 2 or 3 days for the actual run and still be able to turn in the assignment on time.
- Homework is due by 11:59 pm on the indicated date. After that it will be accepted, but penalized. Specifically, each assignment is worth 10 points, from which 1 point will be deducted for each day (or fraction thereof) that you turn it in late. The maximum deduction for being late is 6 points (even if you are more than 6 days late). If you get less than 4 points on an assignment, you are allowed to redo it and take the new score (which will be 4, i.e. 10 - 6, if there are no mistakes).
- It is OK to run your program on someone else's input datafile, and compare outputs to see if you get the same results. However it is not OK to share programs, or to get someone else to debug your program. A key part of the course is being able to write and debug your own programs for data analysis.

**Course Web site: **http://bozeman.mbt.washington.edu/compbio/mbt599/

**Sample Schedule:**

Week |
Tues |
Thurs |

1 | Finding exact matches in sequences. Living organisms as imperfect replication machines. | Theory of evolution & tree of life; 'artificial life'. Mutations as molecular basis for evolutionary change. CpG mutations/CpG islands. Segmental changes. |

2 | Mutation fates. Neutral/nearly neutral theories. Mutation & substitution rates. Generalities on algorithms for biological data; directed graphs. | Depth structure of directed acyclic graphs (DAGs); trees and linked lists. Dynamic programming on weighted DAGs. |

3 | Algorithmic complexity. Maximal-scoring sequence segments. | Edit graphs & sequence alignment. Smith-Waterman algorithm. |

4 | Linear space algorithms. General & affine gap penalties. | Profiles. Probability models on sequences; review of basic probability theory: probability spaces, conditional probabilities, independence. Comparing alternative models. Failure of equal frequency assumption for DNA. Site models. Examples: 3' splice sites, 5' splice sites, protein motifs. Site probability models. Comparing alternative models. |

5 | Weight matrices for site models. Weight matrices for splice sites in C. elegans. Score distributions. Limitations of site models (variable spacing, non-independence). | Hidden Markov Models: introduction; formal definition. HMM examples: -- splice sites; 2-state models; 7-state prokaryote genome model. |

6 | Probabilities of sequences; computing HMM probabilities via associated WDAG. HMM Parameter estimation: Viterbi training, Baum-Welch (EM) algorithm; specialized techniques. | Detection of evolutionarily conserved regions using Phylo-HMMs. |

7 | Detection of evolutionarily conserved regions using Phylo-HMMs (cont'd) | Detection of evolutionarily conserved regions using Phylo-HMMs (cont'd). Information theory: entropy. Information inequality. |

8 | Distributions with maximum entropy. Boltzmann distribution. Coding theory/data compression, uniquely decodable codes. Kraft inequality, entropy & expected code length. | Information. MDL principle and overfitting. Relative entropy. Relative entropies of site models. Sequence logos. |

9 | Maximal scoring segments. D-segments, relationship to 2-state HMMs. Random variables. | Exact & approximate probability distributions for weight matrix scores. Maximal scoring segments. |

10 | Karlin-Altschul theory. |