Biostat 551

Lab 3

Winter 2002

QTL Mapping

Introduction: In this lab you will use QTL cartographer to perform various QTL mapping analyses (Lander & Botstein, Composite Interval Mapping, and Multiple Interval Mapping). Then, you will use Solar to run a variance-components-type analysis on one of the GAW data sets.

Problem 1: (uses files lab3.map and lab3.cro—you can get these from the class web page at http://depts.washington.edu/statgen/Data/Class_551_2002/)

Researchers were interested in mapping genes that control the taproot thickness of the fictional Zuzububu plant. The Zuzububu plant has four chromosomes, each of which is about 170 cM long. They began their experiment with two inbred lines—one line has extremely thin taproots and the other has extremely thick taproots. To create a marker map, they chose 16 markers per chromosome (roughly evenly spaced) for which the two original lines differed in allelic type. The lab3.map file contains information on the markers. A table in lab3.map shows that the inter-marker distances are about 10 cM, with the end markers being about 5 cM from the ends of the chromosomes.

To form the experimental population the researchers crossed the two inbred lines, forming an entirely heterozygous F1 generation. They randomly mated the F1’s to produce 200 F2 individuals. The F2 generation is the experimental population. Data on the F2 generation is in the file lab3.cro. The information in the .cro file looks like a series of blocks of numbers. A typical block looks like:

3 1

0 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1

2 2 2 2 2 2 1 1 1 0 0 0 0 0 0 0

2 2 1 1 0 0 0 0 0 0 0 0 1 1 0 0

1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 2

1.64563479534347

Here’s how to read it: The first line contains the numbers 3 and 1. So, this is the data on trait 1 in individual 3 (we have only one trait, so each block will have a 1 in the trait column). The next four rows each have 16 entries—these are the 16 markers on the 4 chromosomes. Each entry is a number between 0 and 2 and is the count of the number of alleles the individual has at that marker that came from the thick-rooted parental line. The final number in this entry (the 1.645…) is a measure of the taproot thickness of that individual.

Make a directory for your qtlcartographer files. I called mine qtlcart. To make the directory, type %mkdir filename. Save the .cro and .map files listed on the class web page into this directory.

a. Look at the summary statistics for the data set.

First, create a file that contains some summary statistics of the data. Type Qstats.

You should now see a menu of options. Look at options 1-4. You want their values to be “lab3.cro”, “lab3.qst”, “lab3.log” and “lab3.map”. If you need to change them (i.e. if they aren’t already “lab3.*“, then choose option 7 (Change Filename Stem) and change the stem to lab3. Once you change the filename stem, you should be back at the original menu. To run Qstats, choose 0. The results are in a file named “lab3.qst”. Take a look (e.g. by typing more lab3.qst or less lab3.qst).

The first thing in the lab3.qst file is a table of information. M(1) through M(4) are the first four moments of the trait distribution from the F2 plants. If you continue scrolling down, you will come to a histogram of the trait values. After that is a little table with information about maximum and minimum values and such. The next two tables deal with missing trait and marker information. The first of these summarizes the data by marker (how many individuals have information at each marker). The second of these tables summarizes the data by individual (how many markers has each individual been scored at). Since our researchers were extremely diligent, we have no missing data.

The last item in the .qst file is a table that examines whether segregation appears to be Mendelian at each marker. There are two ways to test this: by a chi-squared goodness-of-fit test or by a likelihood ratio test. This table contains the test statistics for both tests. Both statistics are (asymptotically) chi-squared with one degree of freedom.

Your output for this section should be a copy of the histogram of trait values.

b. Run a simple one-marker at a time linear analysis.

The command LRmapqtl will bring up a menu for fitting a simple linear regression model to the data. This is done one by one for each marker with the number of thick taproot line alleles as the independent variable and the trait value as the dependent variable. The error term is assumed to have a normal distribution with mean zero. At the LRmapqtl menu, type 0 to run this analysis. The results are in a file called lab3.lr. Take a look at this file.

You will turn in a brief statement about where it looks like the QTL might be.

c. Run a stepwise regression on the data.

We are running a stepwise regression for two reasons: first, it’s fun, and second, the stepwise regression will tell us which markers to include when we attack this problem using composite interval mapping.

To run a stepwise regression, type SRmapqtl. You might need to change some of the options here. At option 6 in the menu, the choices are FS (forward stepwise), BS (backward stepwise), and FB (forward-backward). You want to choose FB. On options 8 and 9, choose 0.05. When these options have been set, run the regression by typing 0. The output is in the file lab3.sr.

Your output for this section should include the list of markers from the stepwise regression.

d. Use interval mapping techniques to search for the QTL’s.

Both Lander & Botstein and Composite interval map techniques are options in Zmapqtl. To get started, type Zmapqtl. Option 8 lets you choose the type of analysis. Here they are:

Use all the markers (not just the ones from stepwise regression) to control for genetic background. This is model 1 from Zeng (1994).
Use all unlinked markers to control for the genetic background. This is model 2 from Zeng (1994).
Don’t use any markers to control for the genetic background. This is Lander and Botstein’s interval mapping method.
One marker from each chromosome (except the one being tested) is included to control for the genetic background. The marker used will be the most significant marker (for that chromosome) in the LRqtl analysis.
Two markers from each unlinked chromosome are included (the two most significant markers from the LRqtl analysis). Also, all linked markers that are at least 10cM from the interval being tested are included (you can change the 10cM to another value using option 13). This method is ad-hoc, but according to the QTL cartographer manual, it works quite well.
This is the composite interval mapping method that uses the top n significant markers from SRmapqtl, where n can be set by option 12. If the stepwise regression listed less than n markers, then of course it uses only the number of markers available from SRmapqtl. In addition, no markers are used that are within a certain distance (set the distance with option 13) of the interval of interest.

d1. Interval Mapping. First, run option 3—the regular Lander and Botstein Interval Mapping approach. The results are in lab3.z, but they aren’t easy to read. Instead, you should make a graph. Here’s how:

First, type Preplot. You shouldn’t need to change any options. This will create a file called lab3.plt that contains all the information for plotting. To make the plots, type gnuplot lab3.plt. This will immediately show you the results from chromosome 1. To see the other chromosomes, hit the return key.

One last thing…Zmapqtl will do a permutation test to assess significance. This is option 14. Permutation tests take a while, so only do 1000 (if this were a real QTL mapping exercise and not just a lab, you would do at least 10,000, but that would take a long time). Don’t do bootstrapping. On my sample data set, it took about six hours to run the 1000 permutations.

d2. Composite Interval Mapping. Type Zmapqtl and change option 8 (model type) to 6. This will give you a composite interval map. Option 13 determines the window size. This should be set to 10. Option 12 (Number of Background Parameters) allows you to set the number of markers you want to include as parameters in your model. You should set this to 5. Type Preplot and then gnuplot to see the results. This will add the composite interval likelihood ratio curve to your plots.

I recommend that you don’t do permutations for the Composite Interval Mapping. I tried it and it took 19 hours to do 1000 permutations.

e. Multiple Interval Mapping

To use multiple interval mapping to search for QTL’s, the command is MImapqtl. Option 9 allows you to choose the number of QTL’s you want to fit. You should run this twice—once with three QTL’s and once with four.

To view your results, run Preplot and then gnuplot.

To print your results, you need type Preplot again and have it print to a postscript file. To do this, choose option 10(Terminal), and write postscript. Execute Preplot by typing 0. Now gnuplot will print to a file called lab3.ps when you type gnuplot lab3.plt.

Your output from this section should be the four graphs (one for each chromosome). These graphs should contain the results of your Interval Mapping, Composite Interval Mapping, and Multiple Interval Mapping analyses. The significance line will have been set by your permutation tests on your interval mapping analysis. Clearly mark on your graph which analysis is which.

f. Summary of What You Should Turn In:

a. A histogram showing the distribution of trait values (from Qstats).

b. A brief summary of the results of a simple linear regression analysis of the data. Specifically, based on the results of LRmapqtl, where do you think the QTL might be? Can you rule out any regions?

c. A list of the “best” markers from SRmapqtl. How do these match up with the results of LRmapqtl?

d. A sentence or two answering the following question: On the multiple interval mapping analysis, what, if anything, is different between the analyses with three and four QTL’s?

e. A plot that shows the results of the interval mapping, composite interval mapping, and multiple interval mapping analyses. You should use the multiple interval mapping analysis with 4 markers.

f. A paragraph comparing the different analyses (in particular interval mapping and composite interval mapping).

g. (Mostly for fun) How many QTL’s do you think control the root thickness in the fictional Zuzububu plant? Where do you think those QTL’s are located?

2. Variance Component Mapping using Solar

For this problem, you will need to use Solar. There are two things you should know about Solar. First, you get into Solar by typing solar and you get out of it by typing exit. Second, if you ever need help, just type doc and a web page containing a list of commands (and other information) will appear. In this lab, we will be going through most of the tutorial listed on that web page. One more thing: many Unix commands will work just fine within Solar. For example, you can use the cd command to move in and out of directories and you can use less to view files.

Setting up your directories & getting your data.

Before starting with Solar, you should make some directories. First, in your statgen directory, create a directory for your solar files. I called mine solardata. Move into this directory. Within your solardata directory, create two more directories called ibddir and mibddir.

From the class web page, you should copy the following files into your solardata directory:

gaw10.ped .......... pedigree file

phen ............... phenotypes file

mrk9 ............... marker file for chromosome 9

map9 ............... map file for chromosome 9

With theses data sets, we will look on chromosome 9 for locus/loci that affect trait q4.

A Simple Polygenic Model

The first model we will look at will be a simple polygenic model. In this model, there is no major gene, just background polygenes. The free parameters are the mean, additive variance, and environmental variance.

To get into solar, type solar. Then produce the following commands:

solar> load pedigree gaw10.ped (this loads the pedigree)

solar> load phenotypes phen (phen contains the phenotypes for our data set)

solar> trait q4 (q4 is the trait of interest in our data set)

solar> polymod (this tells solar to look at phen and get starting parameter values for a polygenic model).

solar> maximize (this tells solar to find the maximum likelihood estimators for the parameter values).

In addition to printing the results to screen, Solar has created a directory called q4 that contains two output files. You can use the Unix less command to look at these, but they don’t contain any information that hasn’t been printed to your screen.

From the results of this analysis, you should report the values of the MLE’s for the mean and additive standard deviation as well as the value of the loglikelihood at the point of maximization.

Polygenic Model with Covariates

One slightly more sophisticated approach is to include covariates in the polygenic model. to do this, type the following:

solar> covariate age sex age*sex

solar> polygenic –screen

At this point, solar has considered three models: the sporadic model (the null model—no genetic component to the trait); the polygenic model with whichever of age, sex and age*sex were significant; and the polygenic model without covariates. The information on these models is in the q4 directory in files spor.out, poly.out, and nocovar.out, respectively.

For this analysis, do the following:

Report which of age, sex, and age*sex should be included as covariates in the polygenic model.
Report the values of the loglikelihood and the polygenic heritability (h2r = the proportion of the variance that comes from polygenic sources) for all three models discussed so far.

Two-Point Analysis

Before you can begin a two-point analysis, you must create ibd matrices for the markers. The following series of commands will do this:

solar> ibddir ibddir (this tells Solar that the ibd directory will be the directory names ibddir)

solar>load marker mrk9

solar> ibd (this creates the ibd matrices for each marker on chrom. 9)

To run the two-point analysis, type:

solar> twopoint (this takes a while to run, maybe 10-15 minutes)

The final output is written to file twopoint.out in the q4 subdirectory.

For this analysis, do the following:

1. Report the locus that produces the highest LOD score.

Multipoint Analysis

For this analysis, you will need multipoint ibd matrices.

solar> mibddir mibddir

solar>load marker mrk9

solar>load map map9

solar>mibd 1 (this takes a long while, maybe half an hour)

The multipoint analysis works as follows: It does an initial scan, computing LOD scores at test positions along the chromosome (you set the distance between test positions in the command interval). Then, if any test positions have LOD scores above a certain threshold (that you set with the finemap command), multipoint will do a finer mapping (1cM scan) around those positions. That initial scan and the fine map constitute the first pass through the chromosome..

When the first pass is completed, if any test position had a LOD score higher than some threshold (specified in the multipoint) command, another pass will be made. The second pass will assume a QTL exists at the site that produced the highest LOD score in the first pass and, conditioned on the presence of that QTL, will search for a second QTL on the chromosome.

Here are the commands:

solar> chromosome 9

solar> interval 5 (the initial scan will have test positions spaced at 5cM)

solar> finemap 1 0.5 (finemap around positions with LOD’s greater than 1 in the first pass and greater than 0.5 in the second pass)

solar> multipoint 2 1 (search for a second QTL if the first pass yields a LOD score of at least 2.0 and search for a third QTL if the second pass yields a LOD score of at least 1.0)

This takes a while to run. The results from the first pass will be in multipoint1out and the results from the second pass will be in multipoint2.out.

Plotting your multipoint graph.

This is simple. Just type:

solar> plot

Your output from this section should include:

A list of any sites that produced a LOD score of over 3.0 in the first pass (if any such site exists).
The site that produced the highest LOD score in the first pass.
The site that produced the highest LOD score in the second pass (if a second pass was made).
The site that produced the highest LOD score in the third pass (if a third pass was made).
The log-likelihood and “variance” scores for the model you think is best. By “variance” scores, I mean the value for h2r (proportion of variance explained by polygenic background), and, if applicable, h2q1 (proportion of variance explained by the first QTL), h2q2 (proportion explained by second QTL), and h2q3 (proportion explained by third QTL). Also, what covariate(s) are included in your model?