Harwood Lab - Xpression


School of Medicine • University of Washington • Box 357735 • 1705 NE Pacific St • Seattle WA 98195


Home Research Communication Regulation of Photosynthesis Molecular determinants of survival Sense of Touch Acinetobacter baumanii sRNAs Bioenergy Production People Current Past Future Publications Selected Complete list Procedures Organisms Pseudomonas Rhodopseudomonas Omics Genomics Proteomics RNA-Seq Images Databases Links News Contact


Documentation — Xpression 1.0rc1 documentation Navigation index modules \| previous \| Xpression 1.0rc1 documentation » Documentation¶ Xpression can be run from the command-line and the GUI. This section provides documentation for the command-line scripts themselves, as opposed to the GUI Xpression user guide. The material overlaps since the GUI simply wraps the scripts, although more detail can be found here since it is generated from the source itself. To see command-line options for the main script Xpression.py, simply run Xpression without any arguments. Pipeline overview¶ `Xpression.xpression`(input_file='', reference_fasta='', reference_genbank='', sample_name='RNAseq_xpression', sample_barcode=None, input_type='fastq', step='1234', sample_strand_specific='no', sample_read_seq_direction='reverse_complement', compress='yes', allowed_mismatches='2', number_of_cpu='2', output_dir='/home/colin/Xpression_results', barcode_multiplex='', sample_bioseq_start='1', verbose='')[source]¶ Process and analyse FASTQ file(s) input_file and write mapping statistics, expression profile and visualization. These analyses are processed over the following steps: Reads are selected based on quality score and optional barcode. Filtered reads are mapped to reference genome. Mapping statistics output. Expression data is calculated from mapped reads and genbank file supplying gene boundaries and metadata. Expression data is converted to ‘.wig’ format for visualization in IGV, Artemis, and others. Note All arguments are supplied as text, and correspond to command-line long-arguments. This pipeline’s design is very close to a shell pipeline, where subprocesses are started from a parent and supplied with text positional or keyword arguments. Since most parameters will be passed on directly, keeping the input primarily as text allows less conversion to occur, allowing the code to be somewhat more clean. Folders and files specfic to input_fastq with be stored in output_dir/input_fastq/ 0_splitted_FASTQ/ This folder contains the original FASTQ file, split for easy parallel processing. 1_merged_FASTA/ This folder holds a folder for each barcode given in barcode_multiplex and sample_barcode. Each of these is used to store a file of all quality reads starting with the given barcode in bwa archive format. sample_name/ One folder for each sample run with Xpression will hold sample-specific files. These include: 2_sorted_BAM/ The results of mapping this sample’s quality reads to reference_fasta. 3_expression_profile/ One expression profile for each reference mapped to, with loci defined by reference_genbank. 4_visualization/ One visualization file for each reference will be stored here, in .wig format. mapping_statistic.txt This file is a table of mapping statistics of the sample, with sequencing quality, mapping rate, and mapping qualities. First, input_file will be filtered for sequencing quality and separated by barcode in `extract_reads_from_pools`. Next, BWA maps the quality reads for each sample to the reference_file given. reference_fasta is a file of nucleic genome data in fasta format. This file may contain multiple references, such as the contigs in a draft genome. All reference names must match those located in the reference_genbank file if an expression profile is to be generated. BWA will use threads number_of_cpu and mismatches to the reference given by allowed_mismatches. Note Unlike some other short-read aligners, BWA accepts in/dels and mismatches between read and reference. This flexibility allows high mapping to occur despite the relatively low quality of current next-gen sequencing, as well as allowing for point mutations. Its very small memory footprint recomends it as well. Then the SAM file produced by BWA is used with reference_genbank to count reads over genic and intergenic loci. For a read to be used in further processing, it must have a MAPQ score (Phred-scale) of 37 or greater. This quality threshold corresponds to read being unique, where reads with duplicate mapping locations are ignored. Last, a visualization is created in wiggle format from previously stored data corresponding to the expression profile. Step 1 - Extract quality reads by barcode¶ `extract_reads_from_pools.extract_reads_from_pools`(in_fastq_file, in_fastq_type, barcodes, bioseq_start, out_fasta_file, verbose='no')[source]¶ Run extract_reads_from_pools as a script. Briefly, reads in in_fastq_file are filtered for quality and separated by barcode by `extract_reads()`, these results written to disk by `write_results()`, and read counts are printed to stdout by `print_read_counts()` to be collected by the calling function. `extract_reads_from_pools.write_results`(all_reads, out_fasta_file)[source]¶ Write reads for each barcodes in all_reads to out_fasta_file. `extract_reads_from_pools.extract_reads`(in_fastq_file, in_fastq_type, barcodes, bioseq_start)[source]¶ Reads are filtered by sequencing quality and barcode, if any and written to disk. in_fastq_type can be `fastq` or `fastq-illumina` as defined by Biopython’s SeqIO module which specifies the quality score encoding in the last line of a fastq read: fastq Sanger, Illumina1.1-1.3,1.8+ \| Phred +33 ASCII offset fastq-illumina Solexa, Illumina 1.3-1.7 \| Phred +64 ASCII offset FASTQ specification changes over time, with new releases of next-gen sequencing platforms. A `FastQTypeError()` will be raised if the in_fastq_type is incorrect. The biologically relevant segment of the read is determined by the position given by bioseq_start. This is a 1-based position, so for a barcode length of 4, using sample_bioseq_start of 5 means the read will be stored starting at the 5th nucleotide in the read. For example, the following read with barcodes of `AAAA,CCCC,GGGG,TTTT` and bioseq_start `5`: 1234567 ... 36 AAAACGTACGTACGTACGTACGTACGTACGTACGTA barcode match: AAAA relevant read: CGTACGTACGTACGTACGTACGTACGTACGTA the `total` count for `AAAA` will be incremented, and if the quality is high, (ie, minimum base-wise quality score >= 20) the read will be added to the `AAAA` record and the `good` count for `AAAA` will be incremented. Step 2 - Map quality read to reference¶ Reads with `good` quality are mapped to the `reference_genome` by bwa. Options are specified in `Xpression`. Step 3 - Count reads per locus¶ `count_reads_per_region.count_reads_per_region`(sample_name, alignment_file, read_length, read_seq_direction, is_strand_specific, reference_name, reference_annotation_file, slice_start, slice_end, total_uniquely_aligned_reads, output_dir)[source]¶ Count reads over reference_annotation_file, print expression profile to sys.stdout, and store a visualisation of this data. pysam is a python wrapper of samtools which is used to count reads mapped to the loci defined by reference_genbank file. Loci types depend on whether the RNA-seq method is_strand_specific. For strand-preserving methods = `yes`: `sense`, `antis`: genic loci located on top or bottom strand, respectively `igpos`, `igneg`: intergenic loci on top or bottom strand, respectively For these methods, the result of bench-work steps and sequencing can cause the read_seq_direction to be reported as `native` or `reverse_complement` relative to the top-strand of the reference genome. If strandedness is not maintained – is_strand_specific = `no` – loci are reported as `genic` or `inter`. Genic regions are defined as being between the coordinates of a feature extracted from the genbank_file by `create_gene_annotation_pickle()`. Intergenic regions are the spaces between these, and are assigned to a locus as being adjacent to the 5’ end of a gene. Raw counts are generated, defined by the location of the 5’-end of each primer. Read counts are reported as raw, pM, and pKM, the latter two being normalised by the total ‘unique’ reads reported by `count_mapping()` during Step 2. `create_gene_annotation_pickle.create_gene_annotation_pickle`(in_genbank_file_name, in_pickle_file_name)[source]¶ Create list of dictionaries from genbank loci entries and store as pickle. The reference name included as part of in_pickle_file_name is used to search records in in_genbank_file_name. Any record included in the pickle file name will be included in the pickle. Entry types included are: ‘CDS’, ‘tRNA’, ‘rRNA’, ‘misc_RNA’, ‘tmRNA’, and ‘ncRNA’. This script is intended to create a single pickle for each reference that may be contained in the Genbank file. For instance, if the genome is made up of multiple contigs, for each contig, corresponding to its fasta entry name, a pickle file will be created. The filename format could be: genome.contigN.gb.annotation.pickle. `create_gene_annotation_pickle.store_feature_info`(feature, all_feature_list)[source]¶ Glossary¶ raw Number of reads counted within a locus. pM reads per million is raw reads normalised to the factor of (total unique reads / 1 million reads). The number used is reported from step 2 as `unique reads` in the mapping_statistic.txt table for each sample. pKM reads per kilobase-million is pM reads additionally normalised to the factor of (loci length / 1000 bp). Navigation index modules \| previous \| Xpression 1.0rc1 documentation » © Copyright 2012, Harwood Lab, University of Washington. Last updated on 09:43 11 Jun, 2012. Created using Sphinx 1.1.3.