Site Map

Latest Release

Crossbow 1.2.1	5/30/13
Please cite: Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol 10:R134.
For release updates, subscribe to the mailing list.

Related Tools

Bowtie: Ultrafast short read alignment

Hadoop: Open Source MapReduce

Contrail: Cloud-based de novo assembly

CloudBurst: Sensitive MapReduce alignment

Myrna: Cloud, differential gene expression

Tophat: RNA-Seq splice junction mapper

Cufflinks: Isoform assembly, quantitation

SoapSNP: Accurate SNP/consensus calling

Reference jars

H. sapiens: hg18/dbSNP 130

s3n://crossbow-refs/hg18.jar

M. musculus: mm9/dbSNP 128

s3n://crossbow-refs/mm9.jar

E. coli: O157:H7, NCBI (no SNPs)

s3n://crossbow-refs/e_coli.jar

Related publications

Langmead B, Schatz M, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biology 10:R134.
Schatz M, Langmead B, Salzberg SL. Cloud computing and the DNA data race. Nature Biotechnology 2010 Jul;28(7):691-3.
Langmead B, Hansen K, Leek J. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biology 11:R83.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25.
Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J. SNP detection for massively parallel whole-genome resequencing. Genome Res. 2009. 19: 1124-1132.

Authors

Getting started

Mouse Chromosome 17 Example

Mouse Chromosome 17 Example

This example guides you through (a) preprocessing and copying a public short-read dataset from the NCBI Short Read Archive into Amazon S3, (b) creating a reference jar using public data from dbSNP and UCSC, then (c) running a Crossbow job that aligns and calls SNPs from that dataset. The datasets used here are for M. musculus chromosome 17. This example is intended to show how each step of the process works; it does not require much time or money to run. This example is not intended to highlight Crossbow's speed or scalability. Those features are far better demonstrated using much larger, whole-genome datasets.

This example assumes that you have already set up your AWS accounts and credentials as described in the Crossbow Manual.

Reads are taken from a mouse reseqeuncing study by Ian Sudbery and colleagues.

Step 1. Preprocess and copy the reads from the ERA

The quickest way to get started with a copy/preprocessing job is to simply run the cb-copy-interactive script and answer the prompts. When asked for the manifest file, specify the copy.manifiest file in the examples/mouse17 subdirectory of the Crossbow directory. Alternately use this command:

cb-copy-local \
    -n 5 \
    -r 50 \
    -i <path-to-examples/mouse17>/copy.manifest \
    -o s3n://<read-bucket-name>/mm9chr17 \
    -m 2 \
    -t c1.medium \
    -M 500000 \
    -c crossbowcopy<your-account-#-without-dashes>

These are reasonable defaults. In our experiments, this job took about 15 minutes and cost about $2-3. Remember to terminate it when it completes.

Once the job is complete, you should see that the destination directory in S3 contains a set of files (about 115 of them if you used -M 500000) with names that start with ERR0028 and end with gz.

Step 2. Create and upload the reference jar

Scripts have already been created to create a reference jar from publicly available M. musculus genome data (at UCSC) and SNP data (at dbSNP). Change to the reftools subdirectory and run the mm9_chr17_jar shell script. When the script completes, a mm9chr17 subdirectory should have been created containing a jar file named mm9_chr17.jar. Upload this jar file to an S3 bucket (e.g. using s3cmd's put command, Hadoop's hadoop fs -put command, or a graphical user interface such as S3 Firefox Organizer, Bucket Explorer or Cyberduck) and change its permissions to be readable by Everyone. You should now be able to access it through the following URL:

http://<bucket-name>.s3.amazonaws.com/<path-to-jar>

Make a note of the URL for the next step. You may also want to make a note of the reference jar file's MD5 checksum either by running a tool like md5sum on a local copy of mm9_chr17.jar, or by running something like s3cmd ls --list-md5 on the jar in S3.

Step 3. Start the Crossbow job

The quickest way to get started with a Crossbow job is to simply run the cb-interactive script and answer the prompts. When asked for the reference jar file, specify the URL from the previous step. For extra data integrity, also specify the MD5 checksum from the previous step when prompted. The maximum read length in this dataset is 36. When prompted for an instance type, select option 5 (c1.xlarge). When prompted for number of nodes, select 3. Alternately, issue this command (portions in square brackets are optional):

cb-local \
    -r http://<bucket-name>.s3.amazonaws.com/<path-to-jar>[::<MD5>] \
    -n 8 \
    -i s3n://<read-bucket-name>/mm9chr17 \
    -t c1.xlarge \
    -a "-v 2 --strata --best -m 1" \
    -b "-2 -u -n -q" \
    -L 36 \
    -s 1000000 \
    -q phred33 \
    -c crossbow<your-account-#-without-dashes>

These are reasonable defaults. In our experiments, this job took about 30-40 minutes to run and cost about 2.50-3. Remember to terminate it when it completes.

Note that these are not reasonable defaults for genomes larger than than a few hundred megabases. For larger genomes, always use the c1.xlarge instance type (option -t c1.xlarge).

See the manual for instructions on how to monitor your EC2 job.

Step 4. Sanity-check the results

Results are automatically downloaded from the EC2 cluster into the directory from which Crossbow was run and saved in a tar archive with name <cluster-name>-output.tar. To unpack the tar archive, run:

tar xvf <cluster-name>-output.tar

or a similar command. The archive will be unpacked to a subdirectory named output, which will contain a set of files named chrXX.gz, where XX is a chromosome name as specified in the chromosome map (cmap) file when the reference jar was built. Each chrXX.gz file contains all of the SNPs on chromosome XX sorted along the forward reference strand.

Crossbow

Genotyping from short reads using cloud computing