Crossbow

Genotyping from short reads using cloud computing

   

Crossbow is a scalable software pipeline for whole genome resequencing analysis. It combines Bowtie, an ultrafast and memory efficient short read aligner, and SoapSNP, an accurate genotyper, within Hadoop to distribute and accelerate the computation with many nodes. The pipeline can accurately analyze over 35x coverage of a human genome in one day on a 10-node local cluster, or in 3 hours for about $100 using a 40-node, 320-core cluster rented from Amazon's EC2 utility computing service. Open Source Software

Site Map

Latest Release

Related Tools

    Bowtie: Ultrafast short read alignment
    SoapSNP: Accurate SNP/consensus calling
    Contrail: Cloud-based de novo assembly
    CloudBurst: Sensitive MapReduce alignment
    Hadoop: Open Source MapReduce
    Tophat: RNA-Seq splice junction mapper
    Cufflinks: Isoform assembly, quantitation

Reference jars

    Reference jars are not for users to download. Rather, users specify the URL and MD5 for the appropriate jar when running Crossbow and the allocated cluster downloads it automatically.

    H. sapiens: hg18/dbSNP 130
     http://crossbow-refs.s3.amazonaws.com/hg18.jar
     f133c43c0ee8281864f9a86dc403b54b

    M. musculus: mm9/dbSNP 128
     http://crossbow-refs.s3.amazonaws.com/mm9.jar
     2583b66224a6f9e38198f21c4a4729bc

Publications

Authors

Other Documentation

  • WABI 09 Poster (.pdf)

Links

Bowtie 0.12.0 - 12/23/09

Crossbow paper out - 11/20/09

0.1.3 release - 10/21/09

  • cb-local now gives the user clear feedback when worker nodes fail to confirm the MD5 signature of the reference jar. If this failure occurs several times per node across all nodes, the supplied MD5 is probably incorrect.
  • An extra Reduce step was added to the end of the Crossbow job to bin and sort SNPs before downloaded to the user's computer. This step also renames output files by chromosome and deletes empty output files.
  • Added another example that uses recently-published mouse chromosome 17 data (sequenced by Sudbery et al). The TUTORIAL file now points to this new example.
  • More and clearer messages in the output from cb-local.

Crossbow piece on Cloudera Blog - 10/15/09

  • Mike Schatz, Crossbow co-author, has a great guest post on Cloudera's "Hadoop & Big Data" blog about Crossbow.

0.1.2 release - 10/13/09

  • Many fixes for the scripts that automate the reference-jar building process.
  • Added two utility scripts, dist_mfa and sanity_check, to the reftools subdirectory. See their documentation for details.
  • Added scripts for building a reference jar for C. elegans using UCSC's ce6 (WormBase's WS190) assembly and information from dbSNP. This small genome is used in the new TUTORIAL.
  • New TUTORIAL steps the user through preprocessing reads from the NCBI Short Read Archive, creating a reference jar from a UCSC assembly (ce6 in this case) and a set of SNP descriptions from dbSNP, then running Crossbow and examining the resulting SNPs.
  • Extended the preprocess-and-copy infrastructure to allow output from a single input file to be split over many output files. This is critical for achieving good load balance across a breadth of datasets.

0.1.1 release - 10/9/09

  • Added scripts that automate the reference-jar building process for UCSC genomes hg18 and mm9. These scripts can be adapted to other species. See the new "Using Automatic Scripts" subsection of the "Building a Reference Jar" section of the MANUAL for details.
  • License agreement files are now organized better. All licenses applying to all software included in Crossbow are in LICENSE* files in the Crossbow root directory.
  • Minor updates to MANUAL

0.1.0 release - 10/3/09

The first public release of Crossbow is now available for download. This release includes:
  • A detailed manual (MANUAL in the expanded archive) that includes step-by-step instructions for how to get started with Amazon Web Services and Crossbow.
  • Scripts for copying and preprocessing short-read FASTQ files into Amazon's S3 storage service, including an easy-to-use interactive script (cb-copy-interactive)
  • Scripts for running Crossbow either locally or on Amazon's EC2 utility computing service, including an easy-to-use interactive script (cb-interactive)