Xpression can be run from the command-line and the GUI. This section provides documentation for the command-line scripts themselves, as opposed to the GUI Xpression user guide. The material overlaps since the GUI simply wraps the scripts, although more detail can be found here since it is generated from the source itself.
To see command-line options for the main script Xpression.py, simply run Xpression without any arguments.
Process and analyse FASTQ file(s) input_file and write mapping statistics, expression profile and visualization.
These analyses are processed over the following steps:
Reads are selected based on quality score and optional barcode.
Filtered reads are mapped to reference genome. Mapping statistics output.
Expression data is calculated from mapped reads and genbank file supplying gene boundaries and metadata.
Expression data is converted to ‘.wig’ format for visualization in IGV, Artemis, and others.
Note
All arguments are supplied as text, and correspond to command-line long-arguments. This pipeline’s design is very close to a shell pipeline, where subprocesses are started from a parent and supplied with text positional or keyword arguments. Since most parameters will be passed on directly, keeping the input primarily as text allows less conversion to occur, allowing the code to be somewhat more clean.
Folders and files specfic to input_fastq with be stored in output_dir/input_fastq/
0_splitted_FASTQ/
This folder contains the original FASTQ file, split for easy parallel processing.
1_merged_FASTA/
This folder holds a folder for each barcode given in barcode_multiplex and sample_barcode. Each of these is used to store a file of all quality reads starting with the given barcode in bwa archive format.
sample_name/
One folder for each sample run with Xpression will hold sample-specific files.
These include:
2_sorted_BAM/
The results of mapping this sample’s quality reads to reference_fasta.
3_expression_profile/
One expression profile for each reference mapped to, with loci defined by reference_genbank.
4_visualization/
One visualization file for each reference will be stored here, in .wig format.
mapping_statistic.txt
This file is a table of mapping statistics of the sample, with sequencing quality, mapping rate, and mapping qualities.
First, input_file will be filtered for sequencing quality and separated by barcode in extract_reads_from_pools.
Next, BWA maps the quality reads for each sample to the reference_file given. reference_fasta is a file of nucleic genome data in fasta format. This file may contain multiple references, such as the contigs in a draft genome. All reference names must match those located in the reference_genbank file if an expression profile is to be generated. BWA will use threads number_of_cpu and mismatches to the reference given by allowed_mismatches.
Note
Unlike some other short-read aligners, BWA accepts in/dels and mismatches between read and reference. This flexibility allows high mapping to occur despite the relatively low quality of current next-gen sequencing, as well as allowing for point mutations. Its very small memory footprint recomends it as well.
Then the SAM file produced by BWA is used with reference_genbank to count reads over genic and intergenic loci. For a read to be used in further processing, it must have a MAPQ score (Phred-scale) of 37 or greater. This quality threshold corresponds to read being unique, where reads with duplicate mapping locations are ignored.
Last, a visualization is created in wiggle format from previously stored data corresponding to the expression profile.
Briefly, reads in in_fastq_file are filtered for quality and separated by barcode by
extract_reads(), these results written to disk by write_results(), and read
counts are printed to stdout by print_read_counts() to be collected by the calling
function.
Reads are filtered by sequencing quality and barcode, if any and written to disk.
in_fastq_type can be fastq or fastq-illumina as defined by Biopython’s
SeqIO module which specifies the quality score encoding
in the last line of a fastq read:
FASTQ specification changes over
time, with new releases of next-gen sequencing platforms. A FastQTypeError() will be
raised if the in_fastq_type is incorrect.
The biologically relevant segment of the read is determined by the position given by
bioseq_start. This is a 1-based position, so for a barcode length of 4, using
sample_bioseq_start of 5 means the read will be stored starting at the 5th nucleotide in
the read.
For example, the following read with barcodes of AAAA,CCCC,GGGG,TTTT and
bioseq_start 5:
the total count for AAAA will be incremented, and if the quality is high,
(ie, minimum base-wise quality score >= 20) the read will be added to the AAAA record and
the good count for AAAA will be incremented.
Count reads over reference_annotation_file, print expression profile to sys.stdout, and store a visualisation of this data.
pysam is a python wrapper of
samtools which is used to count reads mapped to the loci
defined by reference_genbank file.
Loci types depend on whether the RNA-seq method is_strand_specific. For strand-preserving
methods = yes:
sense, antis: genic loci located on top or bottom strand, respectively
igpos, igneg: intergenic loci on top or bottom strand, respectively
For these methods, the result of bench-work steps and sequencing can cause the
read_seq_direction to be reported as native or reverse_complement relative to the
top-strand of the reference genome.
If strandedness is not maintained – is_strand_specific = no – loci are reported as
genic or inter.
Genic regions are defined as being between the coordinates of a feature extracted from the
genbank_file by create_gene_annotation_pickle(). Intergenic regions are the spaces between
these, and are assigned to a locus as being adjacent to the 5’ end of a gene.
Raw counts are generated, defined by the location of the 5’-end of each primer.
Read counts are reported as raw, pM, and pKM, the latter two being
normalised by the total ‘unique’ reads reported by count_mapping() during Step 2.
Create list of dictionaries from genbank loci entries and store as pickle.
The reference name included as part of in_pickle_file_name is used to search
records in in_genbank_file_name. Any record included in the pickle file name
will be included in the pickle.
Entry types included are: ‘CDS’, ‘tRNA’, ‘rRNA’, ‘misc_RNA’, ‘tmRNA’, and ‘ncRNA’.
This script is intended to create a single pickle for each reference that may be
contained in the Genbank file.
For instance, if the genome is made up of multiple contigs, for each contig,
corresponding to its fasta entry name, a pickle file will be created.
The filename format could be: genome.contigN.gb.annotation.pickle.
reads per million is raw reads normalised to the factor of (total unique reads / 1 million reads). The number used is reported from step 2 as uniquereads in the mapping_statistic.txt table for each sample.
pKM
reads per kilobase-million is pM reads additionally normalised to the factor of (loci length / 1000 bp).