School of Medicine • University of Washington • Box 357735 • 1705 NE Pacific St • Seattle WA 98195 | ||||||
Example Dataset¶We have provided an example dataset which includes everything needed to run Xpression, including a reference genome file, a reference genbank file, and a sequencing fastq file. Reference genome fasta file¶This is the genome of Rhodopseudomonas palustris CGA009, which contains 5,459,213 bp in a finished chromosomal sequence. The format of a fasta file lists a ‘>’ reference name and other information in a single line. For the same entry, the subsequent lines are nucleotide sequence listed in uniform rows until the next entry or the end of the file occurs. The first three lines of this file are as follows: >gi|39748133|emb|BX571963.1| Rhodopseudomonas palustris CGA009 complete genome
ATCGGTCGAGGCGAAATCTTCACCCTGCCCTCGGAATCATATCCATTGCAGCGGAGGGGCCGTCGTGGTT
TTCATAGTCCACCCGCGACGCCCACGGCTCTTCAGATCAGCGCGGTTTGAGAACCAAGGGCGGACATGCA
Reference Genbank annotation file¶This file contains annotations for each feature in the genome. Xpression uses any entry with the following type: CDS, tRNA, rRNA, misc_RNA, tmRNA, ncRNA in step 3 to generate expression profiles. Note The file is the ‘full’ file as opposed to the version which contains only the entry information. This file has the following general format: LOCUS BX571963 5459213 bp DNA circular CON 20-AUG-2004
DEFINITION Rhodopseudomonas palustris CGA009 complete genome.
ACCESSION BX571963 AAAF01000000 AAAF01000001 AAAF01000002 AAAF01000003
AAAF01000004 AAAF01000005 AAAF01000006 AAAF01000007 AAAF01000008
AAAF01000009 AAAF01000010 AAAF01000011 AAAF01000012 AAAF01000013
AAAF01000014 AAAF01000015
VERSION BX571963.1 GI:39748133
KEYWORDS complete genomes.
SOURCE Rhodopseudomonas palustris CGA009
...
FEATURES Location/Qualifiers
source 1..5459213
/organism="Rhodopseudomonas palustris CGA009"
/mol_type="genomic DNA"
/strain="CGA009"
/db_xref="taxon:258594"
gene 679..2097
/gene="dnaA"
/locus_tag="RPA0001"
CDS 679..2097
/gene="dnaA"
/locus_tag="RPA0001"
/function="InterPro IPR001957:IPR003593 COGs COG0593"
/inference="non-experimental evidence, no additional
details recorded"
/codon_start=1
/transl_table=11
/product="chromosomal replication initiator protein DnaA"
/protein_id="CAE25445.1"
/db_xref="GI:39652706"
/db_xref="GOA:Q6NDV3"
/db_xref="InterPro:IPR001957"
/db_xref="InterPro:IPR003593"
/db_xref="InterPro:IPR010921"
/db_xref="InterPro:IPR013159"
/db_xref="InterPro:IPR013317"
/db_xref="UniProtKB/Swiss-Prot:Q6NDV3"
...
Sequence fastq file¶The fastq file is from the GAII platform in Illumina 1.3+ format. It contains 8 multiplexed samples by means of 8 unique 4-mer barcodes used in the cDNA library construction. The barcodes used were ACCC, CGTA, GAGT, TTAG, AGGG, CCAT, GTCA, and TATC. The file contains a total of 2,979,809 reads, and has a compressed size of 56 MB. Reads from the fastq file have the following format: @HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1
CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG
+HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1
acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\dfdffff\fff
Each read is composed of 4 lines. The first and third are title information relating sequencer-specific data, and generally can be disregarded. The second line is the nucleotide read, including ligated barcodes if used. The fourth line is sequencing quality encoded as letters. This quality-encoding is specific to the sequencer version and type. You can read more about the fastq format on Wikipedia. | ||||||