.. _ex_data: Example Dataset ================ We have provided an example dataset which includes everything needed to run Xpression, including a reference genome file, a reference genbank file, and a sequencing fastq file. * `Xpression example dataset `_ Reference genome fasta file ----------------------------- This is the genome of `Rhodopseudomonas palustris` CGA009, which contains 5,459,213 bp in a finished chromosomal sequence. The format of a fasta file lists a '>' reference name and other information in a single line. For the same entry, the subsequent lines are nucleotide sequence listed in uniform rows until the next entry or the end of the file occurs. The first three lines of this file are as follows:: >gi|39748133|emb|BX571963.1| Rhodopseudomonas palustris CGA009 complete genome ATCGGTCGAGGCGAAATCTTCACCCTGCCCTCGGAATCATATCCATTGCAGCGGAGGGGCCGTCGTGGTT TTCATAGTCCACCCGCGACGCCCACGGCTCTTCAGATCAGCGCGGTTTGAGAACCAAGGGCGGACATGCA Reference Genbank annotation file --------------------------------- This file contains annotations for each feature in the genome. Xpression uses any entry with the following type: ``CDS``, ``tRNA``, ``rRNA``, ``misc_RNA``, ``tmRNA``, ``ncRNA`` in step 3 to generate expression profiles. .. note:: The file is the 'full' file as opposed to the version which contains only the entry information. This file has the following general format:: LOCUS BX571963 5459213 bp DNA circular CON 20-AUG-2004 DEFINITION Rhodopseudomonas palustris CGA009 complete genome. ACCESSION BX571963 AAAF01000000 AAAF01000001 AAAF01000002 AAAF01000003 AAAF01000004 AAAF01000005 AAAF01000006 AAAF01000007 AAAF01000008 AAAF01000009 AAAF01000010 AAAF01000011 AAAF01000012 AAAF01000013 AAAF01000014 AAAF01000015 VERSION BX571963.1 GI:39748133 KEYWORDS complete genomes. SOURCE Rhodopseudomonas palustris CGA009 ... FEATURES Location/Qualifiers source 1..5459213 /organism="Rhodopseudomonas palustris CGA009" /mol_type="genomic DNA" /strain="CGA009" /db_xref="taxon:258594" gene 679..2097 /gene="dnaA" /locus_tag="RPA0001" CDS 679..2097 /gene="dnaA" /locus_tag="RPA0001" /function="InterPro IPR001957:IPR003593 COGs COG0593" /inference="non-experimental evidence, no additional details recorded" /codon_start=1 /transl_table=11 /product="chromosomal replication initiator protein DnaA" /protein_id="CAE25445.1" /db_xref="GI:39652706" /db_xref="GOA:Q6NDV3" /db_xref="InterPro:IPR001957" /db_xref="InterPro:IPR003593" /db_xref="InterPro:IPR010921" /db_xref="InterPro:IPR013159" /db_xref="InterPro:IPR013317" /db_xref="UniProtKB/Swiss-Prot:Q6NDV3" ... Sequence fastq file -------------------- The fastq file is from the GAII platform in Illumina 1.3+ format. It contains 8 multiplexed samples by means of 8 unique 4-mer barcodes used in the cDNA library construction. The barcodes used were ``ACCC``, ``CGTA``, ``GAGT``, ``TTAG``, ``AGGG``, ``CCAT``, ``GTCA``, and ``TATC``. The file contains a total of 2,979,809 reads, and has a compressed size of 56 MB. Reads from the fastq file have the following format:: @HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1 CGTAGCTGTGTGTACAAGGCCCGGGAACGTATTCACCGTG +HWUSI-EAS300R_0005_FC62TL2AAXX:8:30:18447:12115#0/1 acdd^aa_Z^d^ddc`^_Q_aaa`_ddc\dfdffff\fff Each read is composed of 4 lines. The first and third are title information relating sequencer-specific data, and generally can be disregarded. The second line is the nucleotide read, including ligated barcodes if used. The fourth line is sequencing quality encoded as letters. This quality-encoding is specific to the sequencer version and type. You can read more about the `fastq format `_ on Wikipedia.