supplement.html

Agrobacterium tumefaciens C58 genome project: Methods and supplemental data

The information in this supplement complements our recent analysis of the genome sequence published in the 14 December 2001 issue of Science: Wood, D. W., J. C. Setubal, R. Kaul, D. E. Monks, J. P. Kitajima, V. K. Okura, Y. Zhou, et al. The genome of the natural genetic engineer Agrobacterium tumefaciens C58. Science 294 (5550): 2317-2323. The data at this site will be periodically updated to reflect continuing analyses. Please refer any questions to agro@u.washington.edu.

Original supplemental data is available online at the Science website. NOTE: Supplemental data at the Science site utilizes our original gene numbering system which has been subsequently updated. Cross matches between old and new numbering systems are available.

We will host access to supplemental data for the accompanying sequence paper by Goodner et al. (Science 294: (5550): 2323-2328). This data will be available shortly in a version that will be periodically updated by the group at Cereon Genomics. This data is also available at Science Online. This data was provided by Dr. Steven Slater of Cereon Genomics, please address any questions to steven.c.slater@cereon.com.

Genbank accession numbers
Sequencing
Southern analysis of linear chromosome
Origin of replication
Annotation
Paralogous families
GC content
Codon usage
Insertion sequences analysis
Phylogenetic analyses
    COG
    Phylogenetic trees
        16S DNA
        Protein (manuscript fig. 2B/C)
Whole genome comparisons and unique genes
    nucleotide alignments
    protein comparisons
        Top blast hit analysis (manuscript fig. 2A)
        Bi-directional Best Hit (BBH) analysis (manuscript fig. 3)
        Ortholog analysis (manuscript fig. 1, Table 2)
        Unique genes
Transporter analysis
Regulatory families
Metabolic pathway analysis
References

Most supplementary data are provided in .pdf format. You will need to install the Acrobat reader program from Adobe systems to view these files.

GENBANK ACCESSION NUMBERS

The sequence has been deposited at Genbank under the following accession numbers: Circular chromosome (AE008688), linear chromosome (AE008689), pAtC58 (AE008687), pTiC58 (AE008690). These sequences will be available publicly following the publication of the manuscript.

SEQUENCING

Agrobacterium tumefaciens strain C58 was sequenced using standard DNA sequencing protocols and data collection tools, and was based on the shotgun approach (1). Two small-insert libraries were constructed in pUC18 with an average insert length of 2.0 kb and 5.0 kb respectively. DNA was physically sheared by nebulization, and size fractionated by agarose gel electrophoresis. DNA was extracted from 96-well plate cultures using the Qiagen (Chatsworth, CA) R.E.A.L. method. Cosmid and fosmid libraries, with inserts ranging between 40 and 50-kb, were also generated. All clones were sequenced from both ends using M13 universal primers. Sequencing reactions were performed with dRhodamine Dye terminator or Big-dye terminator chemistry (Applied Biosystems Prism Dye Terminator Reaction Ready) and run on ABI 377-XL DNA sequencers or ABI 3700 CE sequencers.

A total of 135,703 reads (132,226 small insert shotgun reads, 1,359-cosmid and 2,118-fosmid end-sequencing reads) were generated. Sequences were assembled and viewed using the phred/phrap/consed software (http://www.phrap.org). The assembly was carried out on a 667-MHZ dual processor Compaq alpha computer with 8 GB of memory. The initial sequence assembly took about nine hours. To facilitate opening and viewing the genome assemblies in consed, a phd.ball file from all phd files was created. The creation of the phd.ball file reduced the time to open the genome assembly in consed to less than 15 min. The initial assembly provided 8.68X Q20 sequence coverage (Q20: phred error rate < 1%) (2), with 99.4% genome coverage. The genome was finished using the autofinish tool of consed (16). Periodically the sequences were reassembled with each assembly requiring about five-and-a half-hours of processing time. The final assembly contained 131,299 reads including 3,730 reads from autofinish and advanced finishing experiments. Five gross misassemblies were identified by analysis of the paired ends of fosmid end-sequences. They were all due to the presence of nearly identical repeat sequences in the genome. These repeat sequences included four copies of the 6.7-Kb ribosomal DNA (two copies each in circular and linear replicons) and three copies of a 1,598 base sequence (one copy in each of the circular, linear and pATC58 replicons). Unique fosmid clones that spanned misassembled regions were selected and sequenced to 8X Q20 coverage. The final validation of the sequence assembly was achieved by comparing the restriction fragment digest of 269 fosmid clones (providing nearly continuous 2X-genome coverage) with the computational restriction digests of the final assembly. An independent validation of the sequence quality was provided by comparison of the final C58 sequence based on the whole genome assembly with the independently finished sequence of two randomly chosen fosmid clones. The two quality-control fosmid clones spanned a combined length of 81,801 bases, accounting for 1.44% of the genome. There were no discrepancies between the whole genome assembly and the sequence of the two quality-control fosmid clones.

SOUTHERN ANALYSIS OF LINEAR CHROMOSOME

Genomic DNA was prepared from A. tumefaciens strain C58, including protease treatment, as previously described (3). Restriction sites were chosen and digests performed such that small fragments of defined size were generated from each end. Three restriction sites were chosen at each end, for a total of six digests. Southern blot analysis was performed on the digested DNA using radiolabelled probes specific for each end. Resultant band sizes, consistent with that predicted from the linear chromosome sequence data, indicated that complete sequence had been obtained to within a 40-bp agarose gel resolution.

ORIGINS OF REPLICATION

The putative origin of replication for the circular chromosome was determined based on gene cluster conservation (4) and confirmed by GC-skew [(G-C)/(G+C)] analysis (10 kb sliding window). Base 1 of this sequence was chosen to be an arbitrary base midway between genes AtC58-1970 (ortholog to RP001) and AtC58-1971 (hemE). The genes around this putative origin are the following:

-parB -parA -gidB -gidA -thdF +hypo(*) +hypo(*) -TFrho -RP883 -hemH(*) -hemE ORIGIN +RP001 +maf +aroE +hypo/coaE +dnaQ

The signs +/- denote strand. The order and content depicted are those found in Caulobacter crescentus, as described in (4). Those marked with (*) are not found in Agrobacterium tumefaciens C58 at this location. Sinorhizobium meliloti has exactly the same genes as A. tumefaciens in the region depicted, and its origin has been chosen at the point indicated above.

The origin of replication of the AT plasmid is thought to be just upstream of the repABC operon (5). We chose base 1 of this sequence as an arbitray base midway between genes AtC58-436 (repA) and AtC58-435 (hypothetical). The origin of replication of the Ti plasmid is also thought to be near its repC gene, but we chose base 1 as the first base of the left T-border sequence, consistent with previously published Ti plasmid sequences from other A. tumefaciens strains.

The origin of replication of the linear chromosome is thought to be close to its center (which is near position 1,037,780). There is a repABC operon close to the center (repA start is at 1,026,567), and at position 1,076,640 there is a major signal inversion of the GC-skew curve (10 kb sliding window).

ANNOTATION

Annotation was carried out on an annotation system and interface developed at the Bioinformatics Laboratory of the University of Campinas, Brazil.

Putative genes were identified using BLASTX (6) and Glimmer (7). Protein function was predicted based on comparisons to sequences in the public databases Genbank (using BLAST (6)), PFAM (8), and COG (9). These comparisons were used as auxiliary tools by human curators, who annotated each gene. The curators assigned predicted proteins to one of 18 functional categories modified from those described by Riley (10). RNA species were identified using BLASTN (6) and tRNAscan-SE (11). We thank K. Williams for assistance in identifying the tmRNA borders.

A text file containing the complete list of A. tumefaciens protein-coding genes separated in categories is available. This file is 586 kilobytes long. Gene maps for each replicon are available from the following links:

PARALOGOUS FAMILIES

A gene from A. tumefaciens (the query) was said to be paralogous to another A. tumefaciens gene (the subject) if a BLASTP (6) of the query against the subject resulted in a hit with e-value less than or equal to e-5 and query and subject coverage were both at least 60%, with some additional clustering processing. The BLAST subject database in this case was the entire A. tumefaciens proteome. A complete list of the paralogous families and their membership is available.

GC CONTENT

Manuscript figure 1 contains information of GC content on a gene-by-gene basis. Here we present GC content graphs using a 5 kb sliding window. Datapoints on each graph reflect the GC content of the window centered at that point.

CODON USAGE

Codon frequencies were computed using the codonw program written by John Peden: http://molbiol.ox.ac.uk/cu. A comparison of codon usage frequencies for the entire genome, each of the replicons and the T-DNA, vir genes and AT island is available. (Note: Please use the magnify function of Acrobat reader to view the file details since the file size is quite large and it opens in a small format window).

This table is divided into 8 sections: the whole genome, one section for each of the replicons, the AT island, T-DNA, and the genes of the vir region. For each section, the table lists the frequencies at which codons occur within groups of the same aminoacid. For example, if an aminoacid has 6 codons, the relative frequencies for its codons add up to 6. This means that codons with relative frequencies greater than 1 are overrepresented with respect to uniform usage. Similarly, codons with relative frequencies less than 1 are underrepresented. Therefore, relative frequencies reflect biases in codon usage. Two simple comparisons were made in an attempt to compare codon usage between subsets of genes in the genome with the codon usage in the genome as a whole. First, codon usage frequencies for individual codons were expressed as a percentage of the frequency in the genome as a whole. Second, the absolute difference between these same two frequencies was determined.

Graphs showing codon bias on all 4 replicons are also available. They were done following the methodology proposed by Karlin (21) and using a sliding window of 20 kb. In this methodology codon bias is found by comparing the codon usage in the genes in each window to the codon usage of the genome as a whole.

circular chromosome. The spike around position 1,920,000 contains a major cluster of ribosomal protein genes.
linear chromosome. The spike around position 930,000 contains phage-related genes. The spike around position 1,750,000 contains transposases. Note also the spikes at both ends.
pAtC58. The very first spike corresponds to the AT-island.
pTiC58 The first spike corresponds to the T-DNA. The last two spikes correspond to the vir region.

INSERTION SEQUENCE ANALYSIS

Insertion Sequences (IS) were identified manually using tools available at the Insertion Sequence database (http://www-IS.biotoul.fr/is.html).

PHYLOGENETIC ANALYSES

COG analysis

Predicted proteins of A. tumefaciens and S. meliloti were run through the COGnitor program (9). The results were combined with the COG database available from NCBI, which contained 44 complete genomes at the time of analysis (August, 2001). A comparison of the major COG groups for all sequenced organisms sorted by genome size is available.

Phylogenetic trees

16S rDNA trees
For analysis of rDNA sequences, an alignment of small subunit rDNA sequences for alpha-proteobacteria was downloaded from the European Large Subunit Ribosomal RNA Database (14). The 16S rDNA sequences for the three Rhizobia (Sinorhizobium meliloti, Mesorhizobium loti, and Caulobacter crescentus) for which complete genomes are available were added to this alignment manually. Phylogenetic trees were generated from this alignment (after redundant and incomplete sequences were removed and poorly aligned columns were excluded) and from selected subsets of the alignment using parsimony, distance, and likelihood methods available in the PAUP (http://paup.csit.fsu.edu/about.html) program. The 16S rDNA tree containing A.tumefaciens, S. meliloti, M. loti and related species is available.

Protein sequence trees
For analysis of protein sequences, a set of proteins was chosen (RecA, EF-Tu, Ef-G, HSP70, HSP60, RpoA, RpoB, RpoC, some ribosomal proteins, DnaJ) for which molecular systematics has been shown to be reasonably reliable and for which homologs are available in all complete genomes of free living bacteria. For each protein, a multiple sequence alignment was generated including all homologs. A tree was generated from these alignments, after ambiguously aligned positions were excluded, using PAUP distance methods and a distance calculation based on PAM matrices. Trees in the manuscript figure 2B and 2C were made using these same methods. Trees for each of the other proteins listed are available by request from Jonathan Eisen at The Institute for Genomic Research.

WHOLE GENOME COMPARISONS AND UNIQUE GENES

Nucleotide alignments

The A. tumefaciens genome was compared at the nucleotide level to other genomes using MUMmer (15) with default parameter values. The following nucleotide comparisons are available:

A. tumefaciens circular chromosome X S. meliloti chromosome

A. tumefaciens circular chromosome X M. loti chromosome

S. meliloti chromosome X M. loti chromosome

Note:The M. loti chromosome sequence was circularly shifted in these comparisons. Base 1 here corresponds to base 3572980 in the original sequence.

Protein comparisons

Top BLAST hit analysis

Manuscript figure 2A was generated by comparing predicted proteins of A. tumefaciens with proteins from all published complete genomes using fasta3 (13). Top blast hits for all genes were cataloged and the percent of top BLAST hits from each organism was calculated. Top hits for each replicon are also available.

Bi-directional Best Hit (BBH) analysis

Bi-directional best hits were determined using the following approach. A bi-directional best hit (BBH) is a pair (p1,p2) of proteins, p1 in genome A, and p2 in genome B, such that when the proteome of A is BLASTed against the proteome of B, p2 comes out as the best hit for p1, and when B is BLASTed against A p1 comes out as the best hit for p2. The cutoff e-value used was 10^-4.
BBHs were used to generate the proteome alignments in manuscript figure 3 and those shown below. The graphs in the following list show whole proteome comparisons of the A. tumefaciens linear replicon with the chromosomes of S. meliloti and M. loti. As in manuscript figure 3, each data point is a BBH. In these graphs, the regions of gene order conservation can be seen as strings of consecutive or nearly consecutive points. The graphs are enlarged to show two extensive regions of gene order conservation. The first graph shows the conservation with respect to the S. meliloti chromosome. In S. meliloti, these genes are located in a section that is not colinear with the A. tumefaciens circular chromosome. The second graph shows that roughly the same two regions are conserved with respect to the M. loti chromosome.

A. tumefaciens linear chromosome X S. meliloti chromosome

A. tumefaciens linear chromosome X M. loti chromosome

A detailed view of one of the regions with extensive gene order conservation mentioned above can be seen here. This table was generated by a program described in (12).
Ortholog analysis
The numbers of orthologs shown in manuscript table 2 were determined using the following methods. These same data were used to color code the individual open reading frames in manuscript figure 1. This definition of orthologs was used instead of BBHs to allow the orthologs of paralogs to be found. Two proteins were considered orthologs if their BLASTP alignment covered at least 60% of each protein at an expect value of less than or equal to 10^-5. Proteins that did not match these criteria were considered non-orthologous. Manuscript figure 1 was generated by a program adapted from the genome_plot program. Genome_plot was written and kindly sent to us by Rene Gibson, from Genome Therapeutics Corporation.
Unique genes
Unique genes were defined in the following manner. A protein p from genome A was considered unique with respect to genome B if a BLASTP using p as query against the proteome of B yielded no hits at a threshold expect value of 10^-3. Here is a list of unique genes in A. tumefaciens with respect to S. meliloti and M. loti.

TRANSPORTER ANALYSIS

Transporters were identified using BLASTP (6) and HMM-based searches against a database of known and putative membrane transport proteins and classified into families based on the TC system as previously described (17). Here is a complete description of A. tumefaciens transporter predictions. See http://www-biology.ucsd.edu/~ipaulsen/transport for additional details on the methods.

REGULATORY FAMILIES

Methods similar to those employed in the Pseudomonas aeruginosa genome project (18) were used to define the regulatory motifs present in the A. tumefaciens, S. meliloti, M. loti, C. crescentus, and P. aeruginosa genomes. Regulatory family models were extracted from the PFAM 6.6 database (8), and used to generate a local database. This database was then searched using HMMER 2.2g (HMMER User's guide: http://hmmer.wustl.edu) using as queries the predicted proteins in the genomes of each of these organisms. A motif was assigned if the search resulted in a match with an expect value less than or equal to 10^-4. Sensor/response regulator hybrids were defined as proteins which contained both response regulator and sensor kinase motifs. Regulatory proteins may have more than one motif. The numbers of regulatory motifs found in the A. tumefaciens, S. meliloti, M. loti, C. crescentus and P. aeruginosa genomes are shown. The distribution among each of the A. tumefaciens replicons is also available.

METABOLIC PATHWAY ANALYSIS

We analyzed the metabolic pathways of Agro with the PathoLogic program (19) to assess the evidence for the presence in A. tumefaciens of pathways in the MetaCyc pathway database (20). The analysis detected 178 metabolic pathways, containing 755 reaction steps, of which 467 steps had enzymes assigned, and 288 lack enzyme assignments. To assess the presence or absence of a pathway, the analysis emphasized the presence of enzymes that are unique to a pathway, to decrease the likelihood of being misled by the many enzymes that are shared among multiple pathways.

The complete results of this analysis are available at: http://ecocyc.org:1555/AGRO/organism-summary?object=AGRO. The data for metabolic pathways was assembled by our collaborators at SRI International, Drs. Peter Karp and Pedro Romero.

REFERENCES

1. R. D. Fleischmann et al., Science 269, 496-512. (1995).
2. B. Ewing, L. Hillier, M. C. Wendl, P. Green, Genome Res 8, 175-85. (1998).
3. T. Maniatis, E. F. Fritsch, J. Sambrook, Molecular cloning: A laboratory manual (Cold Spring Harbor Laboratory Press, Plainview, NY, 1989).
4. A. K. Brassinga, R. Siam, G. T. Marczynski, J Bacteriol 183, 1824-9. (2001).
5. M. A. Ramirez-Romero, N. Soberon, A. Perez-Oseguera, J. Tellez-Sosa, M. A. Cevallos, J Bacteriol 182, 3117-24. (2000).
6. S. F. Altschul et al., Nucleic Acids Res 25, 3389-402. (1997).
7. A. L. Delcher, D. Harmon, S. Kasif, O. White, S. L. Salzberg, Nucleic Acids Res 27, 4636-41. (1999).
8. A. Bateman et al., Nucleic Acids Res 28, 263-6. (2000).
9. R. L. Tatusov et al., Nucleic Acids Res 29, 22-28. (2001).
10. M. Riley, Microbiol Rev 57, 862-952. (1993).
11. T. M. Lowe, S. R. Eddy, Nucleic Acids Res 25, 955-64. (1997).
12. J.C. Setubal, N. F. Almeida Jr., DIMACS Workshop on whole genome comparison, Rutgers University. (2001).
13. W. R. Pearson, Methods Mol Biol 132, 185-219. (2000).
14. J. Wuyts, P. De Rijk, Y. Van de Peer, T. Winkelmans, R. De Wachter, Nucleic Acids Res 29, 175-7. (2001).
15. A. L. Delcher et al., Nucleic Acids Res 27, 2369-76. (1999).
16. D. Gordon, C. Desmarais, P. Green, Genome Res 11, 614-25. (2001).
17. I. T. Paulsen, L. Nguyen, M. K. Sliwinski, R. Rabus, M. H. Saier, Jr., J Mol Biol 301, 75-100. (2000).
18. C. K. Stover et al., Nature 406, 959-64. (2000).
19. P. D. Karp, M. Krummenacker, S. Paley, J. Wagg, Trends Biotechnol 17, 275-81. (1999).
20. P. D. Karp et al., Nucleic Acids Res 28, 56-9. (2000).
21. S. Karlin, Trends Microbiol 9, 335-343. (2001)

Last update: 141523 Dec 01 PST

please refer any questions or comments to agro@u.washington.edu

Web design and maintenance: Derek Wood