School of MedicineUniversity of Washington • Box 357735 • 1705 NE Pacific St • Seattle WA 98195

Documentation — Xpression 1.0rc1 documentation

# Documentation¶

Xpression can be run from the command-line and the GUI. This section provides documentation for the command-line scripts themselves, as opposed to the GUI Xpression user guide. The material overlaps since the GUI simply wraps the scripts, although more detail can be found here since it is generated from the source itself.

To see command-line options for the main script Xpression.py, simply run Xpression without any arguments.

## Pipeline overview¶

Xpression.xpression(input_file='', reference_fasta='', reference_genbank='', sample_name='RNAseq_xpression', sample_barcode=None, input_type='fastq', step='1234', sample_strand_specific='no', sample_read_seq_direction='reverse_complement', compress='yes', allowed_mismatches='2', number_of_cpu='2', output_dir='/home/colin/Xpression_results', barcode_multiplex='', sample_bioseq_start='1', verbose='')[source]

Process and analyse FASTQ file(s) input_file and write mapping statistics, expression profile and visualization.

These analyses are processed over the following steps:

1. Reads are selected based on quality score and optional barcode.
2. Filtered reads are mapped to reference genome. Mapping statistics output.
3. Expression data is calculated from mapped reads and genbank file supplying gene boundaries and metadata.
4. Expression data is converted to ‘.wig’ format for visualization in IGV, Artemis, and others.

Note

All arguments are supplied as text, and correspond to command-line long-arguments. This pipeline’s design is very close to a shell pipeline, where subprocesses are started from a parent and supplied with text positional or keyword arguments. Since most parameters will be passed on directly, keeping the input primarily as text allows less conversion to occur, allowing the code to be somewhat more clean.

Folders and files specfic to input_fastq with be stored in output_dir/input_fastq/

0_splitted_FASTQ/
This folder contains the original FASTQ file, split for easy parallel processing.
1_merged_FASTA/
This folder holds a folder for each barcode given in barcode_multiplex and sample_barcode. Each of these is used to store a file of all quality reads starting with the given barcode in bwa archive format.
sample_name/

One folder for each sample run with Xpression will hold sample-specific files.

These include:
2_sorted_BAM/
The results of mapping this sample’s quality reads to reference_fasta.
3_expression_profile/
One expression profile for each reference mapped to, with loci defined by reference_genbank.
4_visualization/
One visualization file for each reference will be stored here, in .wig format.
mapping_statistic.txt
This file is a table of mapping statistics of the sample, with sequencing quality, mapping rate, and mapping qualities.

First, input_file will be filtered for sequencing quality and separated by barcode in extract_reads_from_pools.

Next, BWA maps the quality reads for each sample to the reference_file given. reference_fasta is a file of nucleic genome data in fasta format. This file may contain multiple references, such as the contigs in a draft genome. All reference names must match those located in the reference_genbank file if an expression profile is to be generated. BWA will use threads number_of_cpu and mismatches to the reference given by allowed_mismatches.

Note

Unlike some other short-read aligners, BWA accepts in/dels and mismatches between read and reference. This flexibility allows high mapping to occur despite the relatively low quality of current next-gen sequencing, as well as allowing for point mutations. Its very small memory footprint recomends it as well.

Then the SAM file produced by BWA is used with reference_genbank to count reads over genic and intergenic loci. For a read to be used in further processing, it must have a MAPQ score (Phred-scale) of 37 or greater. This quality threshold corresponds to read being unique, where reads with duplicate mapping locations are ignored.

Last, a visualization is created in wiggle format from previously stored data corresponding to the expression profile.

## Step 1 - Extract quality reads by barcode¶

Briefly, reads in in_fastq_file are filtered for quality and separated by barcode by extract_reads(), these results written to disk by write_results(), and read counts are printed to stdout by print_read_counts() to be collected by the calling function.

Reads are filtered by sequencing quality and barcode, if any and written to disk.

in_fastq_type can be fastq or fastq-illumina as defined by Biopython’s SeqIO module which specifies the quality score encoding in the last line of a fastq read:

fastq           Sanger, Illumina1.1-1.3,1.8+ | Phred +33 ASCII offset
fastq-illumina  Solexa, Illumina 1.3-1.7     | Phred +64 ASCII offset

FASTQ specification changes over time, with new releases of next-gen sequencing platforms. A FastQTypeError() will be raised if the in_fastq_type is incorrect.

The biologically relevant segment of the read is determined by the position given by bioseq_start. This is a 1-based position, so for a barcode length of 4, using sample_bioseq_start of 5 means the read will be stored starting at the 5th nucleotide in the read.

For example, the following read with barcodes of AAAA,CCCC,GGGG,TTTT and bioseq_start 5:

1234567 ...                        36
AAAACGTACGTACGTACGTACGTACGTACGTACGTA

barcode match: AAAA
relevant read: CGTACGTACGTACGTACGTACGTACGTACGTA

the total count for AAAA will be incremented, and if the quality is high, (ie, minimum base-wise quality score >= 20) the read will be added to the AAAA record and the good count for AAAA will be incremented.

## Step 2 - Map quality read to reference¶

Reads with good quality are mapped to the reference_genome by bwa. Options are specified in Xpression.

## Step 3 - Count reads per locus¶

Count reads over reference_annotation_file, print expression profile to sys.stdout, and store a visualisation of this data.

pysam is a python wrapper of samtools which is used to count reads mapped to the loci defined by reference_genbank file.

Loci types depend on whether the RNA-seq method is_strand_specific. For strand-preserving methods = yes:

sense, antis: genic loci located on top or bottom strand, respectively

igpos, igneg: intergenic loci on top or bottom strand, respectively

For these methods, the result of bench-work steps and sequencing can cause the read_seq_direction to be reported as native or reverse_complement relative to the top-strand of the reference genome.

If strandedness is not maintained – is_strand_specific = no – loci are reported as genic or inter.

Genic regions are defined as being between the coordinates of a feature extracted from the genbank_file by create_gene_annotation_pickle(). Intergenic regions are the spaces between these, and are assigned to a locus as being adjacent to the 5’ end of a gene.

Raw counts are generated, defined by the location of the 5’-end of each primer.

Read counts are reported as raw, pM, and pKM, the latter two being normalised by the total ‘unique’ reads reported by count_mapping() during Step 2.

create_gene_annotation_pickle.create_gene_annotation_pickle(in_genbank_file_name, in_pickle_file_name)[source]

Create list of dictionaries from genbank loci entries and store as pickle.

The reference name included as part of in_pickle_file_name is used to search records in in_genbank_file_name. Any record included in the pickle file name will be included in the pickle.

Entry types included are: ‘CDS’, ‘tRNA’, ‘rRNA’, ‘misc_RNA’, ‘tmRNA’, and ‘ncRNA’.

This script is intended to create a single pickle for each reference that may be contained in the Genbank file.

For instance, if the genome is made up of multiple contigs, for each contig, corresponding to its fasta entry name, a pickle file will be created.

The filename format could be: genome.contigN.gb.annotation.pickle.

create_gene_annotation_pickle.store_feature_info(feature, all_feature_list)[source]

### Glossary¶

raw
Number of reads counted within a locus.
pM
reads per million is raw reads normalised to the factor of (total unique reads / 1 million reads). The number used is reported from step 2 as unique reads in the mapping_statistic.txt table for each sample.
pKM
reads per kilobase-million is pM reads additionally normalised to the factor of (loci length / 1000 bp).