Harwood Lab masthead
School of MedicineUniversity of Washington • Box 357735 • 1705 NE Pacific St • Seattle WA 98195
   
Example Dataset — Xpression 1.0rc1 documentation

Example Dataset

We have provided an example dataset which includes everything needed to run Xpression, including a reference genome file, a reference genbank file, and a sequencing fastq file.

Reference genome fasta file

This is the genome of Rhodopseudomonas palustris CGA009, which contains 5,459,213 bp in a finished chromosomal sequence. The format of a fasta file lists a ‘>’ reference name and other information in a single line. For the same entry, the subsequent lines are nucleotide sequence listed in uniform rows until the next entry or the end of the file occurs.

The first three lines of this file are as follows:

>gi|39748133|emb|BX571963.1| Rhodopseudomonas palustris CGA009 complete genome
ATCGGTCGAGGCGAAATCTTCACCCTGCCCTCGGAATCATATCCATTGCAGCGGAGGGGCCGTCGTGGTT
TTCATAGTCCACCCGCGACGCCCACGGCTCTTCAGATCAGCGCGGTTTGAGAACCAAGGGCGGACATGCA

Reference Genbank annotation file

This file contains annotations for each feature in the genome. Xpression uses any entry with the following type: CDS, tRNA, rRNA, misc_RNA, tmRNA, ncRNA in step 3 to generate expression profiles.

Note

The file is the ‘full’ file as opposed to the version which contains only the entry information.

This file has the following general format:

LOCUS       BX571963             5459213 bp    DNA     circular CON 20-AUG-2004
DEFINITION  Rhodopseudomonas palustris CGA009 complete genome.
ACCESSION   BX571963 AAAF01000000 AAAF01000001 AAAF01000002 AAAF01000003
        AAAF01000004 AAAF01000005 AAAF01000006 AAAF01000007 AAAF01000008
        AAAF01000009 AAAF01000010 AAAF01000011 AAAF01000012 AAAF01000013
        AAAF01000014 AAAF01000015
VERSION     BX571963.1  GI:39748133
KEYWORDS    complete genomes.
SOURCE      Rhodopseudomonas palustris CGA009

...

FEATURES             Location/Qualifiers
 source          1..5459213
                 /organism="Rhodopseudomonas palustris CGA009"
                 /mol_type="genomic DNA"
                 /strain="CGA009"
                 /db_xref="taxon:258594"
 gene            679..2097
                 /gene="dnaA"
                 /locus_tag="RPA0001"
 CDS             679..2097
                 /gene="dnaA"
                 /locus_tag="RPA0001"
                 /function="InterPro IPR001957:IPR003593 COGs COG0593"
                 /inference="non-experimental evidence, no additional
                 details recorded"
                 /codon_start=1
                 /transl_table=11
                 /product="chromosomal replication initiator protein DnaA"
                 /protein_id="CAE25445.1"
                 /db_xref="GI:39652706"
                 /db_xref="GOA:Q6NDV3"
                 /db_xref="InterPro:IPR001957"
                 /db_xref="InterPro:IPR003593"
                 /db_xref="InterPro:IPR010921"
                 /db_xref="InterPro:IPR013159"
                 /db_xref="InterPro:IPR013317"
                 /db_xref="UniProtKB/Swiss-Prot:Q6NDV3"

...

Sequence fastq file

The fastq file is from the GAII platform in Illumina 1.3+ format. It contains 8 multiplexed samples by means of 8 unique 4-mer barcodes used in the cDNA library construction. The barcodes used were ACCC, CGTA, GAGT, TTAG, AGGG, CCAT, GTCA, and TATC. The file contains a total of 2,979,809 reads, and has a compressed size of 56 MB.

Reads from the fastq file have the following format:

@HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1
CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG
+HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1
acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\dfdffff\fff

Each read is composed of 4 lines. The first and third are title information relating sequencer-specific data, and generally can be disregarded.

The second line is the nucleotide read, including ligated barcodes if used. The fourth line is sequencing quality encoded as letters. This quality-encoding is specific to the sequencer version and type.

You can read more about the fastq format on Wikipedia.