Here we demonstrate automating a traditional next generation sequencing (NGS) analysis pipeline using Python. The steps for NGS analysis are shown below.
Step | Description | Tools |
---|---|---|
1. Download Data | Fetch raw sequencing data | SRA Toolkit, Python |
2. Quality Control | Assess data quality | FastQC |
3. Preprocessing | Trim adapters and filter low-quality reads | Trimmomatic, Cutadapt |
4. Alignment | Align reads to a reference genome | BWA, Bowtie2 |
5. BAM Processing | Convert, sort, and index BAM files | Samtools |
6. Variant Calling | Identify SNPs and indels | FreeBayes, GATK |
7. Annotation | Add biological context to variants | ANNOVAR, SnpEff |
8. Visualization | Visualize and generate reports | Pandas, Matplotlib |
The Sequence Read Archive (SRA) is the largest publicly available respository of high throughout sequencing data.
SRA Database: Here.
We'll use E. Coli as the genome of choice for its simplicity and low memory footprint for efficient demonstration.
E. Coli Sequencing Run ID: SRR31041149
The NGS analysis python code is presented in a Jupyter Notebok. This article explains the general principles, and the notebook demonstrates implementing the code at each step.
We'll be using some binaries for processing and filtering sequencing reads in FASTQ files.
SRA Toolkit Download: Here.
In NGS, the original genome or transcript is randomly fragmented into small pieces, and these pieces are sequenced. The sequence and its accompanying base quality score encompasses a single read, and multiple reads are in a single FASTQ file.
Note: Reads do not represent contiguous segments of the original sequence, since they are randomly fragmented.
To mitigate gaps in sequence coverage, NGS workflows utilize high coverage and overlapping reads to recover missing information.
There are two types of accession numbers that one might see on NCBI's SRA database: SRR and SRX.
SRR (SRA Run)
SRX (SRA Experiment)
It is common to see file extensions such as FASTQ or FASTA. Both contain nucleotide sequences, but FASTQ contains quality score for each base in the same file.
Quality score is represented by certain ASCII characters:
`!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
More on Bioinfomratics file formats:
https://bioinformaticsworkbook.org/introduction/fileFormats.html#gsc.tab=0
Traditionally, long sequences are split into multiple reads (each read is 4 lines) within the same file, with "+" or "@" used as seperators. This gets confusing to account for, more so in FASTQ with quality score ASCII characters.
Today, Illumina's software than can do high quality short-reads (~100 base pairs), so every 4 lines in a FASTQ file is a read with information for a 100 fragment of the larger sequence.
Example for 1 read in FASTQ file:
@HISEQ:402:H147CADXX:1:1101:1250:2208 1:N:0:CGATGT TGATGCTGCNAATTTTATTCAGTCAGCGGAGGGGGCTTACGTGTATTTTCTGCAACCTTT
+
CCCFFFFFH#4AFIJJJJJJJJIJJJJJJJJJJJJJJJJJJHHHHHHFFFFFFFEEEEED
Adapters: Short DNA sequences ligated to ends of DNA/RNA fragments during library preparation.
Adapters allow for fragments to bind to sequencing flow cell and serve as priming sites for sequencing.
Adapter contamination occurs when adapater sequences are not removed during the sequencing reads. This causes adapter sequences to appear in the raw data and affects analysis including alignment, assembly, and variant calling.
Causes:
Over-amplification in PCR: If PCR runs for too many cycles, the adapter-ligand sequences are replicated too many times in sequence, causing multiple adapter copies in single read.
During the Quality Control step using FastQC, we need to convert the downloaded .sra file to a .fastq format. Use fastq-dump from SRA Toolkit for this task.
For paired-end reads, the output above will yield 2 .fastq files. One file corresponds to the forward read, and the other to the reverse read.
A report of quality statistics are presented within the output HTML files for each pair-end read.
Interpreting the HTML Report
After generating the HTML report with fastqc and if adapter contamination is present, we can apply the trimmomatic from the SRA Toolkit to remove adapters or low-qality bases at the sequence ends.
For alignment step: We'll use the provided E. Coli sequence as the reference genome.
Then, We need to index the reference genome so that the alignment software can perform alignment with the trimmed, paired reads faster. If not, then it would have to linearly scan entire reference genome for every read.
Indexing Steps
Index files allow alignment software to perform fast lookups and avoid linear searching for every read.
After alignment, convert the sequence alignment map to a binary alignment map (BAM) for faster lookup
Then, filter variants by quality score. We don't want to have false positives when calling variants.
Summary statistics for the filtered variants can be viewed using bcftools.
The Integrative Genomics Viewer tool allows us to visualize aligned reads and called variants. Simply laod the aligned BAM file, reference genome, and overlay the VCF file to identify variant locations.
Explained above, and through the associated Jupyter Notebook code was a traditional statistical approach to identifying variants of a genomic sequence relative to a reference genome.
Today there are machine learning based callers such as DeepVariant by Google that are based on CNNs to call SNPs and indels (insertions / deletions). Another ML-based variant caller is Clair3.
In addition to identifying variants, machine learning models today can also: