cmalign - align sequences to a covariance model
- [options] <cmfile>
aligns the RNA sequences in <seqfile>
covariance model (CM) in <cmfile>.
The new alignment is output to
in Stockholm format, but can be redirected to a file
with the -o <f>
(but not both) may be '-'
(dash), which means reading this input from stdin
rather than a file.
The sequence file <seqfile>
must be in FASTA or Genbank format.
uses an HMM banding technique to accelerate alignment by default
as described below for the --hbanded
option. HMM banding can be turned
off with the --nonbanded
By default, cmalign
computes the alignment with maximum expected accuracy
that is consistent with constraints (bands) derived from an HMM, using a
banded version of the Durbin/Holmes optimal accuracy algorithm. This behavior
can be changed with the --cyk
takes special care to correctly align truncated sequences, where
some nucleotides from the beginning (5') and/or end (3') of the actual full
length biological sequence are not present in the input sequence (see DL Kolbe
and SR Eddy, Bioinformatics, 25:1236-1243, 2009). This behavior is on by
default, but can be turned off with --notrunc.
In previous versions of
option was required to appropriately handle
truncated sequences. The --sub
option is still available in this
version, but the new default method for handling truncated sequences should be
as good or superior to the sub method in nearly all cases.
The --mapali <s>
option allows inclusion of the fixed
training alignment used to build the CM from file <s>
output alignment of cmalign.
It is possible to merge two or more alignments created by the same CM using the
Easel miniapp esl-alimerge
(included in the easel/miniapps/
subdirectory of Infernal). Previous versions of cmalign
options to merge alignments but they were deprecated upon development of
which is significantly more memory efficient.
By default, cmalign
will output the alignment to stdout. The alignment
can be redirected to an output file <f>
with the -o
option. With -o,
information on each aligned sequence,
including score and model alignment boundaries will be printed to stdout (more
on this below).
The output alignment will be in Stockholm format by default. This can be changed
to Pfam, aligned FASTA (AFA), A2M, Clustal, or Phylip format using the
option, where <s>
is the name
of the desired format. As a special case, if the output alignment is large
(more than 10,000 sequences or more than 10,000,000 total nucleotides) than
the output format will be Pfam format, with each sequence appearing on a
single line, for reasons of memory efficiency. For alignments larger than
this, using --ileaved
will force interleaved Stockholm format, but the
user should be aware that this may require a lot of memory. --ileaved
will only work for alignments up to 100,000 sequences or 100,000,000 total
If the output alignment format is Stockholm or Pfam, the output alignment will
be annotated with posterior probabilities which estimate the confidence level
of each aligned nucleotide. This annotation appears as lines beginning with
"#=GR <seq name> PP", one per sequence, each immediately below
the corresponding aligned sequence "<seq name>". Characters in
PP lines have 12 possible values: "0-9", "*", or
".". If ".", the position corresponds to a gap in the
sequence. A value of "0" indicates a posterior probability of
between 0.0 and 0.05, "1" indicates between 0.05 and 0.15,
"2" indicates between 0.15 and 0.25 and so on up to "9"
which indicates between 0.85 and 0.95. A value of "*" indicates a
posterior probability of between 0.95 and 1.0. Higher posterior probabilities
correspond to greater confidence that the aligned nucleotide belongs where it
appears in the alignment. With --nonbanded,
the calculation of the
posterior probabilities considers all possible alignments of the target
sequence to the CM. Without --nonbanded
(i.e. in default mode), the
calculation considers only possible alignments within the HMM bands. Further,
the posterior probabilities are conditional on the truncation mode of the
alignment. For example, if the sequence alignment is truncated 5', a PP value
of "9" indicates between 0.85 and 0.95 of all 5' truncated
alignments include the given nucleotide at the given position. The posterior
annotation can be turned off with the --noprob
is enabled, posterior annotation must also be turned off using
The tabular output that is printed to stdout if the -o
option is used
includes one line per sequence and twelve fields per line: "idx":
the index of the sequence in the input file, "seq name": the
sequence name; "length": the length of the sequence; "cm
from" and "cm to": the model start and end positions of the
alignment; "trunc": "no" if the sequence is not truncated,
"5'" if the beginning of the sequence truncated 5', "3'"
if the end of the sequence is truncated, and "5'&3'" if both the
beginning and the end are truncated; "bit sc": the bit score of the
alignment, "avg pp" the average posterior probability of all aligned
nucleotides in the alignment; "band calc", "alignment" and
"total": the time in seconds required for calculating HMM bands,
computing the alignment, and complete processing of the sequence,
respectively; "mem (Mb)": the size in Mb of all dynamic programming
matrices required for aligning the sequence. This tabular data can be saved to
with the --sfile <f>
- Help; print a brief reminder of command line usage and
- -o <f>
- Save the alignment in Stockholm format to a file
<f>. The default is to write it to standard output.
- Configure the model for global alignment of the query model
to the target sequences. By default, the model is configured for local
alignment. Local alignments can contain large insertions and deletions
called "local ends" in the structure to be penalized differently
than normal indels. These are annotated as "~" columns in the RF
line of the output alignment. The -g option can be used to disallow
these local ends. The -g option is required if the --sub
option is also used.
- Align sequences using the Durbin/Holmes optimal accuracy
algorithm. This is the default. The optimal accuracy alignment will be
constrained by HMM bands for acceleration unless the --nonbanded
option is enabled. The optimal accuracy algorithm determines the alignment
that maximizes the posterior probabilities of the aligned nucleotides
within it. The posterior probabilites are determined using (possibly HMM
banded) variants of the Inside and Outside algorithms.
- Do not use the Durbin/Holmes optimal accuracy alignment to
align the sequences, instead use the CYK algorithm which determines the
optimally scoring (maximum likelihood) alignment of the sequence to the
model, given the HMM bands (unless --nonbanded is also enabled).
- Sample an alignment from the posterior distribution of
alignments. The posterior distribution is determined using an HMM banded
(unless --nonbanded) variant of the Inside algorithm.
- --seed <n>
- Seed the random number generator with <n>, an
integer >= 0. This option can only be used in combination with
--sample. If <n> is nonzero, stochastic sampling of
alignments will be reproducible; the same command will give the same
results. If <n> is 0, the random number generator is seeded
arbitrarily, and stochastic samplings may vary from run to run of the same
command. The default seed is 181.
- Turn off truncated alignment algorithms. All sequences in
the input file will be assumed to be full length, unless --sub is
also used, in which case the program can still handle truncated sequences
but will use an alternative strategy for their alignment.
- Turn on the sub model construction and alignment procedure.
For each sequence, an HMM is first used to predict the model start and end
consensus columns, and a new sub CM is constructed that only models
consensus columns from start to end. The sequence is then aligned to this
sub CM. Sub alignment is an older method than the default one for aligning
sequences that are possibly truncated. By default, cmalign uses
special DP algorithms to handle truncated sequences which should be more
accurate than the sub method in most cases. --sub is still included
as an option mainly for testing against this default truncated sequence
handling. This "sub CM" procedure is not the same as the
"sub CMs" described by Weinberg and Ruzzo.
- This option is turned on by default. Accelerate alignment
by pruning away regions of the CM DP matrix that are deemed negligible by
an HMM. First, each sequence is scored with a CM plan 9 HMM derived from
the CM using the Forward and Backward HMM algorithms to calculate
posterior probabilities that each nucleotide aligns to each state of the
HMM. These posterior probabilities are used to derive constraints (bands)
on the CM DP matrix. Finally, the target sequence is aligned to the CM
using the banded DP matrix, during which cells outside the bands are
ignored. Usually most of the full DP matrix lies outside the bands (often
more than 95%), making this technique faster because fewer DP calculations
are required, and more memory efficient because only cells within the
bands need be allocated.
Importantly, HMM banding sacrifices the guarantee of determining the
optimally accurarte or optimal alignment, which will be missed if it lies
outside the bands. The tau paramater is the amount of probability mass
considered negligible during HMM band calculation; lower values of tau
yield greater speedups but also a greater chance of missing the optimal
alignment. The default tau is 1E-7, determined empirically as a good
tradeoff between sensitivity and speed, though this value can be changed
with the --tau <x> option. The level of acceleration
increases with both the length and primary sequence conservation level of
the family. For example, with the default tau of 1E-7, tRNA models (low
primary sequence conservation with length of about 75 nucleotides) show
about 10X acceleration, and SSU bacterial rRNA models (high primary
sequence conservation with length of about 1500 nucleotides) show about
700X. HMM banding can be turned off with the --nonbanded option.
- --tau <x>
- Set the tail loss probability used during HMM band
calculation to <x>. This is the amount of probability mass
within the HMM posterior probabilities that is considered negligible. The
default value is 1E-7. In general, higher values will result in greater
acceleration, but increase the chance of missing the optimal alignment due
to the HMM bands.
- --mxsize <x>
- Set the maximum allowable total DP matrix size to
<x> megabytes. By default this size is 1028 Mb. This should
be large enough for the vast majority of alignments, however if it is not
cmalign will attempt to iteratively tighten the HMM bands it uses
to constrain the alignment by raising the tau parameter and recalculating
the bands until the total matrix size needed falls below <x>
megabytes or the maximum allowable tau value (0.05 by default, but
changeable with --maxtau) is reached. At each iteration of band
tightening, tau is multiplied by a 2.0. The band tightening strategy can
be turned off with the --fixedtau option. If the maximum tau is
reached and the required matrix size still exceeds <x> or if
HMM banding is not being used and the required matrix size exceeds
<x> then cmalign will exit prematurely and report an
error message that the matrix exceeded its maximum allowable size. In this
case, the --mxsize can be used to raise the size limit or the
maximum tau can be raised with --maxtau. The limit will commonly be
exceeded when the --nonbanded option is used without the
--small option, but can still occur when --nonbanded is not
used. Note that if cmalign is being run in <n>
multiple threads on a multicore machine then each thread may have an
allocated matrix of up to size <x> Mb at any given time.
- Turn off the HMM band tightening strategy described in the
explanation of the --mxsize option above.
- --maxtau <x>
- Set the maximum allowed value for tau during band
tightening, described in the explanation of --mxsize above, to
<x>. By default this value is 0.05.
- Turns off HMM banding. The returned alignment is guaranteed
to be the globally optimally accurate one (by default) or the globally
optimally scoring one (if --cyk is enabled). The --small
option is recommended in combination with this option, because standard
alignment without HMM banding requires a lot of memory (see --small
- Use the divide and conquer CYK alignment algorithm
described in SR Eddy, BMC Bioinformatics 3:18, 2002. The
--nonbanded option must be used in combination with this options.
Also, it is recommended whenever --nonbanded is used that
--small is also used because standard CM alignment without HMM
banding requires a lot of memory, especially for large RNAs.
--small allows CM alignment within practical memory limits,
reducing the memory required for alignment LSU rRNA, the largest known
RNAs, from 150 Gb to less than 300 Mb. This option can only be used in
combination with --nonbanded, --notrunc, and --cyk.
- --sfile <f>
- Dump per-sequence alignment score and timig information to
file <f>. The format of this file is described above (it's
the same data in the same format as the tabular stdout output when the
-o option is used).
- --tfile <f>
- Dump tabular sequence tracebacks for each individual
sequence to a file <f>. Primarily useful for debugging.
- --ifile <f>
- Dump per-sequence insert information to file
<f>. The format of the file is described by
"#"-prefixed comment lines included at the top of the file
<f>. The insert information is valid even when the
--matchonly option is used.
- --elfile <f>
- Dump per-sequence EL state (local end) insert information
to file <f>. The format of the file is described by
"#"-prefixed comment lines included at the top of the file
<f>. The EL insert information is valid even when the
--matchonly option is used.
- --mapali <f>
- Reads the alignment from file <f> used to
build the model aligns it as a single object to the CM; e.g. the alignment
in <f> is held fixed. This allows you to align sequences to a
model with cmalign and view them in the context of an existing
trusted multiple alignment. <f> must be the alignment file
that the CM was built from. The program verifies that the checksum of the
file matches that of the file used to construct the CM. A similar option
to this one was called --withali in previous versions of
- Must be used in combination with --mapali
<f>. Propogate structural information for any pseudoknots
that exist in <f> to the output alignment. A similar option
to this one was called --withstr in previous versions of
- --informat <s>
- Assert that the input <seqfile> is in format
<s>. Do not run Babelfish format autodection. This increases
the reliability of the program somewhat, because the Babelfish can make
mistakes; particularly recommended for unattended, high-throughput runs of
Infernal. Acceptable formats are: FASTA, GENBANK, and DDBJ.
<s> is case-insensitive.
- --outformat <s>
- Specify the output alignment format as <s>.
Acceptable formats are: Pfam, AFA, A2M, Clustal, and Phylip. AFA is
aligned fasta. Only Pfam and Stockholm alignment formats will include
consensus structure annotation and posterior probability annotation of
- Output the alignments as DNA sequence alignments, instead
of RNA ones.
- Do not annotate the output alignment with posterior
- Only include match columns in the output alignment, do not
include any insertions relative to the consensus model. This option may be
useful when creating very large alignments that require a lot of memory
and disk space, most of which is necessary only to deal with insert
columns that are gaps in most sequences.
- Output the alignment in interleaved Stockholm format of a
fixed width that may be more convenient for examination. This was the
default output alignment format of previous versions of cmalign.
Note that cmalign requires more memory when this option is used.
For this reason, --ileaved will only work for alignments of up to
100,000 sequences or a total of 100,000,000 aligned nucleotides.
- --regress <s>
- Save an additional copy of the output alignment with no
author information to file <s>.
- Output additional information in the tabular scores output
(output to stdout if -o is used, or to <f> if
--sfile <f> is used). These are mainly useful for
testing and debugging.
- --cpu <n>
- Specify that <n> parallel CPU workers be used.
If <n> is set as "0", then the program will be run
in serial mode, without using threads. You can also control this number by
setting an environment variable, INFERNAL_NCPU. This option will
only be available if the machine on which Infernal was built is capable of
using POSIX threading (see the Installation section of the user guide for
- Run as an MPI parallel program. This option will only be
available if Infernal has been configured and built with the
"--enable-mpi" flag (see the Installation section of the user
guide for more information).
for a master man page with a list of all the individual
man pages for programs in the Infernal package.
For complete documentation, see the user guide that came with your Infernal
distribution (Userguide.pdf); or see the Infernal web page ().
Copyright (C) 2016 Howard Hughes Medical Institute.
Freely distributed under a BSD open source license.
For additional information on copyright and licensing, see the file called
COPYRIGHT in your Infernal source distribution, or see the Infernal web page
The Eddy/Rivas Laboratory
Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147 USA