blasr - Map SMRT Sequences to a reference genome.
blasr reads.bam genome.fasta -bam -out
blasr reads.fasta genome.fasta
blasr reads.fasta genome.fasta -sa
blasr reads.bax.h5 genome.fasta [-sa
blasr reads.bax.h5 genome.fasta -sa genome.fasta.sa
-maxScore -100 -minMatch 15 ...
blasr reads.bax.h5 genome.fasta -sa genome.fasta.sa
-nproc 24 -out alignment.out ...
is a read mapping program that maps reads to positions in a genome
by clustering short exact matches between the read and the genome, and scoring
clusters using alignment. The matches are generated by searching all suffixes
of a read against the genome using a suffix array. Global chaining methods are
used to score clusters of matches.
The only required inputs to blasr are a file of reads and a reference genome. It
is exremely useful to have read filtering information, and mapping runtime may
decrease substantially when a precomputed suffix array index on the reference
sequence is specified.
Although reads may be input in FASTA format, the recommended input is PacBio BAM
files because these contain qualtiy value information that is used in the
alignment and produces higher quality variant detection. Although alignments
can be output in various formats, the recommended output format is PacBio BAM.
Support for bax.h5 and plx.h5 files will be DEPRECATED
. Support for
region tables for h5 files will be DEPRECATED
When suffix array index of a genome is not specified, the suffix array is built
before producing alignment. This may be prohibitively slow when the genome is
large (e.g. Human). It is best to precompute the suffix array of a genome
using the program sawriter
(1), and then specify the suffix array on the
command line using -sa
The optional parameters are roughly divided into three categories: control over
anchoring, alignment scoring, and output.
The default anchoring parameters are optimal for small genomes and samples with
up to 5% divergence from the reference genome. The main parameter governing
speed and sensitivity is the -minMatch
parameter. For human genome
alignments, a value of 11 or higher is recommended. Several methods may be
used to speed up alignments, at the expense of possibly decreasing
Regions that are too repetitive may be ignored during mapping by limiting the
number of positions a read maps to with the -maxAnchorsPerPosition
option. Values between 500 and 1000 are effective in the human genome.
For small genomes such as bacterial genomes or BACs, the default parameters are
sufficient for maximal sensitivity and good speed.
- Input Files
(DEPRECATED) Options for modifying reads.
- A PacBio BAM file of reads. This is the preferred input to
blasr because rich quality value (insertion,deletion, and
substitution quality values) information is maintained. The extra quality
information improves variant detection and mapping speed.
- A multi-fasta file of reads, though any fasta file is valid
- the old DEPRECATED output format of SMRT reads.
- File of file names
- -sa suffixArrayFile
- Use the suffix array 'sa' for detecting matches between the
reads and the reference. The suffix array has been prepared by the
- -ctab tab
- A table of tuple counts used to estimate match
significance. This is by the program 'printTupleCountTable'. While it is
quick to generate on the fly, if there are many invocations of
blasr, it is useful to precompute the ctab.
- -regionTable table (DEPRECATED)
- Read in a read-region table in HDF format for masking
portions of reads. This may be a single table if there is just one input
file, or a fofn. When a region table is specified, any region table inside
the reads.plx.h5 or reads.bax.h5 files are ignored.
Alignments To Report
There is ancilliary information about substrings of reads that is stored in a
'region table' for each read file. Because HDF is used, the region table may
be part of the .bax.h5 or .plx.h5 file, or a separate file. A contiguously
read substring from the template is a subread, and any read may contain
multiple subreads. The boundaries of the subreads may be inferred from the
region table either directly or by definition of adapter boundaries. Typically
region tables also contain information for the location of the high and low
quality regions of reads. Reads produced by spurious reads from empty ZMWs
have a high quality start coordinate equal to high quality end, making no
- Align the circular consensus sequence (ccs), then report
alignments of the ccs subreads to the window that the ccs was mapped to.
Only alignments of the subreads are reported.
- Similar to -useccs, except all subreads are aligned,
rather than just the subreads used to call the ccs. This will include
reads that only cover part of the template.
- Align the circular consensus, and report only the alignment
of the ccs sequence.
- -noSplitSubreads (false)
- Do not split subreads at adapters. This is typically only
useful when the genome in an unrolled version of a known template, and
contains template-adapter-reverse_template sequence.
- -ignoreRegions (false)
- Ignore any information in the region table.
- -ignoreHQRegions (false)
- Ignore any hq regions in the region table.
Output Formats and Files
- -bestn n (10)
- Report the top n alignments.
- -hitPolicy (all)
- Specify a policy to treat multiple hits from [all, allbest,
random, randombest, leftmost]
- report all alignments.
- report all equally top scoring alignments.
- report a random alignment.
- report a random alignment from multiple equally top scoring
- report an alignment which has the best alignmentscore and
has the smallest mapping coordinate in any reference.
- -placeRepeatsRandomly (false)
- DEPRECATED! If true, equivalent to
- -randomSeed (0)
- Seed for random number generator. By default (0), use
current time as seed.
- -noSortRefinedAlignments (false)
- Once candidate alignments are generated and scored via
sparse dynamic programming, they are rescored using local alignment that
accounts for different error profiles. Resorting based on the local
alignment may change the order the hits are returned.
- When specified, adjacent insertion or deletions are
allowed. Otherwise, adjacent insertion and deletions are merged into one
operation. Using quality values to guide pairwise alignments may dictate
that the higher probability alignment contains adjacent insertions or
deletions. Current tools such as GATK do not permit this and so they are
not reported by default.
Options for anchoring alignment regions.
- -out out (terminal)
- Write output to out.
- Write output in SAM format.
- -m t
- If not printing SAM, modify the output of the
- When t is:
- Print blast like output with |'s connecting matched
- Print only a summary: score and pos.
- Print in Compare.xml format.
- Print in vulgar format (DEPRECATED).
- Print a longer tabular version of the alignment.
- Print in a machine-parsable format that is read by
- Print a header as the first line of the output file
describing the contents of each column.
- -titleTable tab (NULL)
- Construct a table of reference sequence titles. The
reference sequences are enumerated by row, 0,1,... The reference index is
printed in alignment results rather than the full reference name. This
makes output concise, particularly whenvery verbose titles exist in
- -unaligned file
- Output reads that are not aligned to file
- -clipping [none|hard|subread|soft] (none)
- Use no/hard/subread/soft clipping, ONLY for SAM/BAM
- -printSAMQV (false)
- Print quality values to SAM output.
- -cigarUseSeqMatch (false)
- CIGAR strings in SAM/BAM output use '=' and 'X' to
represent sequence match and mismatch instead of 'M'.
Options for Refining Hits
This will have the greatest effect on speed and sensitivity.
- -minMatch m (12)
- Minimum seed length. Higher minMatch will speed up
alignment, but decrease sensitivity.
- -maxMatch l (inf)
- Stop mapping a read to the genome when the lcp length
reaches l. This is useful when the query is part of the reference,
for example when constructing pairwise alignments for de novo
- -maxLCPLength l (inf)
- The same as -maxMatch.
- -maxAnchorsPerPosition m (10000)
- Do not add anchors from a position if it matches to more
than m locations in the target.
- -advanceExactMatches E (0)
- Another trick for speeding up alignments with match - E
fewer anchors. Rather than finding anchors between the read and the genome
at every position in the read, when an anchor is found at position i in a
read of length L, the next position in a read to find an anchor is at
i+L-E. Use this when alignining already assembled contigs.
- -nCandidates n (10)
- Keep up to n candidates for the best alignment. A
large value of n will slow mapping because the slower dynamic programming
steps are applied to more clusters of anchors which can be a rate limiting
step when reads are very long.
- -concordant (false)
- Map all subreads of a zmw (hole) to where the longest full
pass subread of the zmw aligned to. This requires to use the region table
and hq regions. This option only works when reads are in base or pulse h5
- -concordantTemplate (mediansubread)
- Select a full pass subread of a zmw as template for
concordant mapping. longestsubread - use the longest full pass subread
mediansubread - use the median length full pass subread typicalsubread -
use the second longest full pass subread if length of the longest full
pass subread is an outlier
- -fastMaxInterval (false)
- Fast search maximum increasing intervals as alignment
candidates. The search is not as exhaustive as the default, but is much
- -aggressiveIntervalCut (false)
- Agreesively filter out non-promising alignment candidates,
if there exists at least one promising candidate. If this option is turned
on, blasr is likely to ignore short alignments of ALU
- -fastSDP (false)
- Use a fast heuristic algorithm to speed up sparse dynamic
Options for overlap/dynamic programming alignments and pairwise overlap for
de novo assembly.
- -sdpTupleSize K (11)
- Use matches of length K to speed dynamic programming
alignments. This controls accuracy of assigning gaps in pairwise
alignments once a mapping has been found, rather than mapping sensitivity
- -scoreMatrix score matrix string
- Specify an alternative score matrix for scoring fasta
reads. The matrix is in the format
| A C G T N
|A a b c d e
|C f g h i j
|G k l m n o
|T p q r s t
|N u v w x y
The values a...y should be input as a quoted space separated string: "a
b c ... y". Lowerf scores are better, so matches should be less than
mismatches e.g. a,g,m,s = -5 (match), mismatch = 6.
- -affineOpen value (10)
- Set the penalty for opening an affine alignment.
- -affineExtend a (0)
- Change affine (extension) gap penalty. Lower value allows
Options for filtering reads and alignments
- -useQuality (false)
- Use substitution/insertion/deletion/merge quality values to
score gap and mismatch penalties in pairwise alignments. Because the
insertion and deletion rates are much higher than substitution, this will
make many alignments favor an insertion/deletion over a
substitution.nNaive consensus calling methods will then often miss
substitution polymorphisms. This option should be used when calling
consensus using the Quiver method. Furthermore, when not using quality
values to score alignments, there will be a lower consensus accuracy in
- -affineAlign (false)
- Refine alignment using affine guided align.
Options for parallel alignment
- -minReadLength l (50)
- Skip reads that have a full length less than l.
Subreads may be shorter.
- -minSubreadLength l (0)
- Do not align subreads of length less than l.
- -minRawSubreadScore m (0)
- Do not align subreads whose quality score in region table
is less than m (quality scores should be in range [0, 1000]).
- -maxScore m (-200)
- Maximum score to output (high is bad, negative good).
- (0) Report alignments only if their lengths are greater
(0) Report alignments only if their percentage similairty is greater than
- (0) Report alignments only if their percentage accuray is
greater than minAccuracy.
Options for subsampling reads.
- -nproc N (1)
- Align using N processes. All large data structures
such as the suffix array and tuple count table are shared.
- -start S (0)
- Index of the first read to begin aligning. This is useful
when multiple instances are running on the same data, for example when on
a multi-rack cluster.
- -stride S (1)
- Align one read every S reads.
- -subsample (0)
- Proportion of reads to randomly subsample (expressed as a
decimal) and align.
- -holeNumbers LIST
- When specified, only align reads whose ZMW hole numbers are
in LIST. LIST is a comma-delimited string of ranges, such as
'1,2,3,10-13'. This option only works when reads are in bam, bax.h5 or
- Print help information.
To cite BLASR, please use: Chaisson M.J., and Tesler G., Mapping single molecule
sequencing reads using Basic Local Alignment with Successive Refinement
(BLASR): Theory and Application, BMC Bioinformatics 2012, 13:238.
Please report any bugs to