Biostat
551
Lab
3
Winter
2002
Introduction: In this lab
you will use QTL cartographer to perform various QTL mapping analyses (Lander
& Botstein, Composite Interval Mapping, and Multiple Interval Mapping).
Then, you will use Solar to run a variance-components-type analysis on one of
the GAW data sets.
Problem
1:
(uses files lab3.map and lab3.cro—you can get these from the class web page at
http://depts.washington.edu/statgen/Data/Class_551_2002/)
Researchers
were interested in mapping genes that control the taproot thickness of the
fictional Zuzububu plant. The Zuzububu plant has four chromosomes, each of
which is about 170 cM long. They began their experiment with two inbred
lines—one line has extremely thin taproots and the other has extremely thick
taproots. To create a marker map, they chose 16 markers per chromosome (roughly
evenly spaced) for which the two original lines differed in allelic type. The
lab3.map file contains information on the markers. A table in lab3.map shows that the inter-marker distances are
about 10 cM, with the end markers being about 5 cM from the ends of the
chromosomes.
To
form the experimental population the researchers crossed the two inbred lines,
forming an entirely heterozygous F1 generation. They randomly mated the F1’s to
produce 200 F2 individuals. The F2 generation is the experimental population.
Data on the F2 generation is in the file lab3.cro. The information in the .cro
file looks like a series of blocks of numbers. A typical block looks like:
3
1
0 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1
2 2 2 2 2 2 1 1 1 0 0 0 0 0 0 0
2 2 1 1 0 0 0 0 0 0 0 0 1 1 0 0
1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 2
1.64563479534347
Here’s
how to read it: The first line contains the numbers 3 and 1. So, this is the data
on trait 1 in individual 3 (we have only one trait, so each block will have a 1
in the trait column). The next four rows each have 16 entries—these are the 16
markers on the 4 chromosomes. Each entry is a number between 0 and 2 and is the
count of the number of alleles the individual has at that marker that came from
the thick-rooted parental line. The final number in this entry (the 1.645…) is
a measure of the taproot thickness of that individual.
Make
a directory for your qtlcartographer files. I called mine qtlcart. To
make the directory, type %mkdir filename. Save the .cro and .map files
listed on the class web page into this directory.
a.
Look at the summary statistics for the data set.
First,
create a file that contains some summary statistics of the data. Type
Qstats.
You
should now see a menu of options. Look at options 1-4. You want their values to
be “lab3.cro”, “lab3.qst”, “lab3.log” and “lab3.map”. If you need to change
them (i.e. if they aren’t already “lab3.*“, then choose option 7 (Change
Filename Stem) and change the stem to lab3. Once you change the filename stem, you should be back at the
original menu. To run Qstats, choose 0. The results are in a file named
“lab3.qst”. Take a look (e.g. by typing more lab3.qst or less lab3.qst).
The
first thing in the lab3.qst file is a table of information. M(1) through M(4)
are the first four moments of the trait distribution from the F2 plants. If you
continue scrolling down, you will come to a histogram of the trait values.
After that is a little table with information about maximum and minimum values
and such. The next two tables deal with missing trait and marker information.
The first of these summarizes the data by marker (how many individuals have
information at each marker). The second of these tables summarizes the data by
individual (how many markers has each individual been scored at). Since our
researchers were extremely diligent, we have no missing data.
The
last item in the .qst file is a table that examines whether segregation appears
to be Mendelian at each marker. There are two ways to test this: by a
chi-squared goodness-of-fit test or by a likelihood ratio test. This table
contains the test statistics for both tests. Both statistics are
(asymptotically) chi-squared with one degree of freedom.
Your
output for this section should be a copy of the histogram of trait values.
b.
Run a simple one-marker at a time linear analysis.
The
command LRmapqtl will bring up a menu for fitting a simple linear
regression model to the data. This is done one by one for each marker with the
number of thick taproot line alleles as the independent variable and the trait
value as the dependent variable. The error term is assumed to have a normal
distribution with mean zero. At the LRmapqtl menu, type 0 to run
this analysis. The results are in a file called lab3.lr. Take a look at this
file.
You
will turn in a brief statement about where it looks like the QTL might be.
c.
Run a stepwise regression on the data.
We
are running a stepwise regression for two reasons: first, it’s fun, and second,
the stepwise regression will tell us which markers to include when we attack
this problem using composite interval mapping.
To
run a stepwise regression, type SRmapqtl. You might need to change some
of the options here. At option 6 in the menu, the choices are FS (forward
stepwise), BS (backward stepwise), and FB (forward-backward). You want to
choose FB. On options 8 and 9, choose 0.05. When these options have been set,
run the regression by typing 0. The output is in the file lab3.sr.
Your
output for this section should include the list of markers from the stepwise
regression.
d.
Use interval mapping techniques to search for the QTL’s.
Both
Lander & Botstein and Composite interval map techniques are options in
Zmapqtl. To get started, type Zmapqtl. Option 8 lets you choose the type
of analysis. Here they are:
d1. Interval Mapping. First, run
option 3—the regular Lander and Botstein Interval Mapping approach. The results
are in lab3.z, but they aren’t easy to read. Instead, you should make a graph.
Here’s how:
First, type Preplot. You
shouldn’t need to change any options. This will create a file called lab3.plt
that contains all the information for plotting. To make the plots, type gnuplot
lab3.plt. This will immediately show you the results from chromosome 1. To
see the other chromosomes, hit the return key.
One last thing…Zmapqtl will do a
permutation test to assess significance. This is option 14. Permutation tests
take a while, so only do 1000 (if this were a real QTL mapping exercise and not
just a lab, you would do at least 10,000, but that would take a long time).
Don’t do bootstrapping. On my sample data set, it took about six hours to run
the 1000 permutations.
d2. Composite Interval Mapping. Type Zmapqtl
and change option 8 (model type) to 6. This will give you a composite interval
map. Option 13 determines the window size. This should be set to 10. Option 12
(Number of Background Parameters) allows you to set the number of markers you
want to include as parameters in your model. You should set this to 5. Type Preplot
and then gnuplot to see the results. This will add the composite
interval likelihood ratio curve to your plots.
I
recommend that you don’t do permutations for the Composite Interval Mapping. I
tried it and it took 19 hours to do 1000 permutations.
e.
Multiple Interval Mapping
To use multiple interval mapping to
search for QTL’s, the command is MImapqtl. Option 9 allows you to choose
the number of QTL’s you want to fit. You should run this twice—once with three
QTL’s and once with four.
To view your results, run Preplot
and then gnuplot.
To print your results, you need type Preplot
again and have it print to a postscript file. To do this, choose option
10(Terminal), and write postscript. Execute Preplot by typing 0.
Now gnuplot will print to a file called lab3.ps when you type gnuplot
lab3.plt.
Your output from this section should be
the four graphs (one for each chromosome). These graphs should contain the
results of your Interval Mapping, Composite Interval Mapping, and Multiple
Interval Mapping analyses. The significance line will have been set by your
permutation tests on your interval mapping analysis. Clearly mark on your graph
which analysis is which.
f.
Summary of What You Should Turn In:
a.
A histogram showing the distribution of trait values (from
Qstats).
b.
A brief summary of the results of a simple linear regression
analysis of the data. Specifically, based on the results of LRmapqtl, where do
you think the QTL might be? Can you rule out any regions?
c.
A list of the “best” markers from SRmapqtl. How do these
match up with the results of LRmapqtl?
d.
A sentence or two answering the following question: On the
multiple interval mapping analysis, what, if anything, is different between the
analyses with three and four QTL’s?
e.
A plot that shows the results of the interval mapping,
composite interval mapping, and multiple interval mapping analyses. You should
use the multiple interval mapping analysis with 4 markers.
f.
A paragraph comparing the different analyses (in particular
interval mapping and composite interval mapping).
g.
(Mostly for fun) How many QTL’s do you think control the
root thickness in the fictional Zuzububu plant? Where do you think those QTL’s
are located?
2.
Variance Component Mapping using Solar
For
this problem, you will need to use Solar. There are two things you should know
about Solar. First, you get into Solar by typing solar and you get out
of it by typing exit. Second, if you ever need help, just type doc
and a web page containing a list of commands (and other information) will
appear. In this lab, we will be going through most of the tutorial listed on
that web page. One more thing: many Unix commands will work just fine within
Solar. For example, you can use the cd command to move in and out of
directories and you can use less to view files.
Before starting with Solar, you should
make some directories. First, in your statgen directory, create a directory for
your solar files. I called mine solardata. Move into this directory.
Within your solardata directory, create two more directories called ibddir and
mibddir.
From the class web page, you should
copy the following files into your solardata directory:
gaw10.ped ..........
pedigree file
phen ...............
phenotypes file
mrk9 ............... marker
file for chromosome 9
map9 ............... map file
for chromosome 9
With theses data sets, we will look on
chromosome 9 for locus/loci that affect trait q4.
The first model we will look at will be a simple polygenic model. In this model, there is no major gene, just background polygenes. The free parameters are the mean, additive variance, and environmental variance.
To get into solar, type solar. Then produce the following commands:
solar> load pedigree gaw10.ped (this loads the pedigree)
solar> load phenotypes phen (phen contains the phenotypes for our data set)
solar> trait q4 (q4 is the trait of interest in our data set)
solar> polymod (this tells solar to look at phen and get starting parameter values for a polygenic model).
solar> maximize (this tells solar to find the maximum likelihood estimators for the parameter values).
In addition to printing the results to screen, Solar has created a directory called q4 that contains two output files. You can use the Unix less command to look at these, but they don’t contain any information that hasn’t been printed to your screen.
From the results of this analysis, you should report the values of the MLE’s for the mean and additive standard deviation as well as the value of the loglikelihood at the point of maximization.
One slightly more sophisticated approach is to include covariates in the polygenic model. to do this, type the following:
solar> covariate age sex age*sex
solar> polygenic –screen
At this point, solar has considered three models: the sporadic model (the null model—no genetic component to the trait); the polygenic model with whichever of age, sex and age*sex were significant; and the polygenic model without covariates. The information on these models is in the q4 directory in files spor.out, poly.out, and nocovar.out, respectively.
For this analysis, do the following:
Before you can begin a two-point analysis, you must create
ibd matrices for the markers. The following series of commands will do this:
solar> ibddir ibddir (this tells Solar that the ibd directory
will be the directory names ibddir)
solar>load marker mrk9
solar> ibd (this creates the ibd matrices for each marker on chrom. 9)
To run the two-point analysis, type:
solar> twopoint (this takes a while to run, maybe 10-15 minutes)
The final output is written to file twopoint.out in the q4 subdirectory.
For this analysis, do the following:
1. Report the
locus that produces the highest LOD score.
For this analysis, you will need multipoint ibd matrices.
solar> mibddir mibddir
solar>load marker mrk9
solar>load map map9
solar>mibd 1 (this takes a long while, maybe half an hour)
The multipoint analysis works as follows: It does an initial scan, computing LOD scores at test positions along the chromosome (you set the distance between test positions in the command interval). Then, if any test positions have LOD scores above a certain threshold (that you set with the finemap command), multipoint will do a finer mapping (1cM scan) around those positions. That initial scan and the fine map constitute the first pass through the chromosome..
When the first pass is completed, if any test position had a LOD score higher than some threshold (specified in the multipoint) command, another pass will be made. The second pass will assume a QTL exists at the site that produced the highest LOD score in the first pass and, conditioned on the presence of that QTL, will search for a second QTL on the chromosome.
Here are the commands:
solar> chromosome 9
solar> interval 5 (the initial scan will have test positions spaced at 5cM)
solar> finemap 1 0.5 (finemap around positions with LOD’s greater than 1 in the first pass and greater than 0.5 in the second pass)
solar> multipoint 2 1 (search for a second QTL if the first pass yields a LOD score of at least 2.0 and search for a third QTL if the second pass yields a LOD score of at least 1.0)
This takes a while to run. The results from the first pass will be in multipoint1out and the results from the second pass will be in multipoint2.out.
This is simple. Just type:
solar> plot
Your output from this section should include: