Sequence alignment One

1.1 introduction

Bowtie2

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index (based on the Burrows-Wheeler Transform or BWT) to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 gigabytes of RAM. Bowtie 2 supports gapped, local, and paired-end alignment modes. Multiple processors can be used simultaneously to achieve greater alignment speed. Bowtie 2 outputs alignments in SAM format, enabling interoperation with a large number of other tools (e.g. SAMtools, GATK) that use SAM. Bowtie 2 is distributed under the GPLv3 license, and it runs on the command line under Windows, Mac OS X and Linux.

Bowtie 2 is often the first step in pipelines for comparative genomics, including for variation calling, ChIP-seq, RNA-seq, BS-seq. Bowtie 2 and Bowtie (also called "Bowtie 1" here) are also tightly integrated into some tools, including TopHat: a fast splice junction mapper for RNA-seq reads, Cufflinks: a tool for transcriptome assembly and isoform quantitiation from RNA-seq reads, Crossbow: a cloud-enabled software tool for analyzing resequencing data, and Myrna: a cloud-enabled software tool for aligning RNA-seq reads and measuring differential gene expression.

Question

How is Bowtie 2 different from Bowtie 1 and other alignment software ?

1.2 The bowtie2-build indexer

bowtie2-build

bowtie2-build builds a Bowtie index from a set of DNA sequences. bowtie2-build outputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. In the case of a large index these suffixes will have a bt2l termination. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by Bowtie 2 once the index is built

Usage:

% bowtie2-build [options]* < reference_in > < bt2_base >

Get description and help:

% bowtie2-build

1.3 Indexing a reference genome  

bowtie2-build

We have already downloaded the reference genome of Lambda phage in the home directory. Use less command to see the sequence. (Tips: To exit the less command press either the "q" or "Q" keys.)

% less /home/lambda_virus.fa

To create an index for the Lambda phage reference genome, create a new temporary directory (you can use your current working directory), change into that directory, and run:

% bowtie2-build /home/lambda_virus.fa lambda_virus

The command should print many lines of output then quit. When the command completes, the current directory will contain four new files that all start with lambda_virus and end with .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. These files constitute the index - you're done!

You can use bowtie2-build to create an index for a set of FASTA files obtained from any source, including sites such as UCSC, NCBI, and Ensembl. When indexing multiple FASTA files, specify all the files using commas to separate file names. For more details on how to create an index with bowtie2-build, see the manual section on index building. You may also want to bypass this process by obtaining a pre-built index. See using a pre-built index below for an example.

Exercise 1a

use a linux command to list these index files

1.4 Extracts summary information from a Bowtie index  

bowtie2-inspect

bowtie2-inspect extracts information from a Bowtie index about what kind of index it is and what reference sequences were used to build it. When run without any options, the tool will output a FASTA file containing the sequences of the original references (with all non-A/C/G/T characters converted to Ns). It can also be used to extract just the reference sequence names using the -n/--names option or a more verbose summary using the -s/--summary option.

% bowtie2-inspect -s lambda_virus

Exercise 1b

In bowtie2-inspect result, what is the length of the reference genome? Try to use wc -c to count the character length of the original lambda_virus.fa file.

Summary

bowtie2-build build up an index
bowtie2-inspect extract the summary information of the indexed reference genome

tongyinbio@hku.hk bbru@hku.hk 13th-Feb 2017