cmsearch - search covariance model(s) against a sequence database
cmsearch [options] <cmfile> <seqdb>
is used to search one or more covariance models (CMs) against a
sequence database. For each CM in <cmfile>,
use that query CM to
search the target database of sequences in <seqdb>,
ranked lists of the sequences with the most significant matches to the CM. To
build CMs from multiple alignments, see cmbuild.
The query <cmfile>
must have been calibrated for E-values with
As a special exception, any models in
that have zero basepairs need not be calibrated. For
these models, profile HMM search algorithms will be used instead of CM ones,
as discussed further below.
The query <cmfile>
may be '-' (a dash character), in which case the
query CM input will be read from a <stdin> pipe instead of from a file.
may not be '-' because the current implementation
needs to be able to rewind the database, which is not possible with
The output format is designed to be human-readable, but is often so voluminous
that reading it is impractical, and parsing it is a pain. The --tblout
option saves output in a simple tabular format that is concise and easier to
parse. The -o
option allows redirecting the main output, including
throwing it away in /dev/null.
reexamines the 5' and 3' termini of target sequences using
specialized algorithms for detection of truncated
hits, in which part
of the 5' and/or 3' end of the actual full length homologous sequence is
missing in the target sequence file. These types of hits will be most common
in sequence files consisting of unassembled sequencing reads. By default, any
5' truncated hit is required to include the first residue of the target
sequence it derives from in <seqdb>,
and any 3' truncated hit is
required to include the final residue of the target sequence it derives from.
Any 5' and 3' truncated hit must include the first and final residue of the
target sequence it derives from. The --anytrunc
option will relax the
requirements for hit inclusion of sequence endpoints, and truncated hits are
allowed to start and stop at any positions of target sequences. Importantly
though, with --anytrunc,
hit E-values will be less accurate because
model calibration does not consider the possibility of truncated hits, so use
it with caution. The --notrunc
option can be used to turn off truncated
hit detection. --notrunc
will reduce the running time of
most significantly for target <seqdb>
include many short sequences.
Truncated hit detection is automatically turned off when the --max,
options are used because
it relies on the use of an accelerated HMM banded alignment strategy that is
turned off by any of those options.
- Help; print a brief reminder of command line usage and all
- Turn on the glocal alignment algorithm, global with
respect to the query model and local with respect to the target database.
By default, the local alignment algorithm is used which is local with
respect to both the target sequence and the model. In local mode, the
alignment to span two or more subsequences if necessary (e.g. if the
structures of the query model and target sequence are only partially
shared), allowing certain large insertions and deletions in the structure
to be penalized differently than normal indels. Local mode performs better
on empirical benchmarks and is significantly more sensitive for remote
homology detection. Empirically, glocal searches return many fewer hits
than local searches, so glocal may be desired for some applications. With
-g, all models must be calibrated, even those with zero basepairs.
- -Z <x>
- Calculate E-values as if the search space size was
<x> megabases (Mb). Without the use of this option, the
search space size is defined as the total number of nucleotides in
<seqdb> times 2, because both strands of each target sequence
will be searched.
- Print help, as with -h , but also include expert
options that are not displayed with -h . These expert options are
not expected to be relevant for the vast majority of users and so are not
described in the manual page. The only resources for understanding what
they actually do are the brief one-line descriptions output when
--devhelp is enabled, and the source code.
- -o <f>
- Direct the main human-readable output to a file
<f> instead of the default stdout.
- -A <f>
- Save a multiple alignment of all significant hits (those
satisfying inclusion thresholds) to the file <f>.
- --tblout <f>
- Save a simple tabular (space-delimited) file summarizing
the hits found, with one data line per hit. The format of this file is
described in section 6 of the Infernal user guide.
- Use accessions instead of names in the main output, where
available for profiles and/or sequences.
- Omit the alignment section from the main output. This can
greatly reduce the output volume.
- Unlimit the length of each line in the main output. The
default is a limit of 120 characters per line, which helps in displaying
the output cleanly on terminals and in editors, but can truncate target
profile description lines.
- --textw <n>
- Set the main output's line length limit to <n>
characters per line. The default is 120.
- Include extra search pipeline statistics in the main
output, including filter survival statistics for truncated hit detection
and number of envelopes discarded due to matrix size overflows.
Reporting thresholds control which hits are reported in output files (the main
output and --tblout)
Hits are ranked by statistical significance
(E-value). By default, all hits with an E-value <= 10 are reported. The
following options allow you to change the default E-value reporting
thresholds, or to use bit score thresholds instead.
- -E <x>
- In the per-target output, report target sequences with an
E-value of <= <x>. The default is 10.0, meaning that on
average, about 10 false positives will be reported per query, so you can
see the top of the noise and decide for yourself if it's really noise.
- -T <x>
- Instead of thresholding per-CM output on E-value, report
target sequences with a bit score of >= <x>.
Inclusion thresholds are stricter than reporting thresholds. Inclusion
thresholds control which hits are considered to be reliable enough to be
included in an output alignment or in a possible subsequent search round, or
marked as significant ("!") as opposed to questionable
("?") in hit output.
- --incE <x>
- Use an E-value of <= <x> as the hit
inclusion threshold. The default is 0.01, meaning that on average, about 1
false positive would be expected in every 100 searches with different
- --incT <x>
- Instead of using E-values for setting the inclusion
threshold, instead use a bit score of >= <x> as the hit
inclusion threshold. By default this option is unset.
Curated CM databases may define specific bit score thresholds for each CM,
superseding any thresholding based on statistical significance alone.
To use these options, the profile must contain the appropriate (GA, TC, and/or
NC) optional score threshold annotation; this is picked up by cmbuild
from Stockholm format alignment files. Each thresholding option has a score of
bits, and acts as if -T <x>
has been applied specifically using each
model's curated thresholds.
- Use the GA (gathering) bit scores in the model to set hit
reporting and inclusion thresholds. GA thresholds are generally considered
to be the reliable curated thresholds defining family membership; for
example, in Rfam, these thresholds define what gets included in Rfam Full
alignments based on searches with Rfam Seed models.
- Use the NC (noise cutoff) bit score thresholds in the model
to set hit reporting and inclusion thresholds. NC thresholds are generally
considered to be the score of the highest-scoring known false positive.
- Use the TC (trusted cutoff) bit score thresholds in the
model to set hit reporting and inclusion thresholds. TC thresholds are
generally considered to be the score of the lowest-scoring known true
positive that is above all known false positives.
Infernal 1.1 searches are accelerated in a six-stage filter pipeline. The first
five stages use a profile HMM to define envelopes that are passed to the stage
six CM CYK filter. Any envelopes that survive all filters are assigned final
scores using the the CM Inside algorithm. (See the user guide for more
The profile HMM filter is built by the cmbuild
program and is stored in
Each successive filter is slower than the previous one, but better than it at
disciminating between subsequences that may contain high-scoring CM hits and
those that do not. The first three HMM filter stages are the same as those
used in HMMER3. Stage 1 (F1) is the local HMM SSV filter modified for long
sequences. Stage 2 (F2) is the local HMM Viterbi filter. Stage 3 (F3) is the
local HMM Forward filter. Each of the first three stages uses the profile HMM
in local mode, which allows a target subsequence to align to any region of the
HMM. Stage 4 (F4) is a glocal HMM filter, which requires a target subsequence
to align to the full-length profile HMM. Stage 5 (F5) is the glocal HMM
envelope definition filter, which uses HMMER3's domain identification
heursitics to define envelope boundaries. After each stage from 2 to 5 a bias
filter step (F2b, F3b, F4b, and F5b) is used to remove sequences that appear
to have passed the filter due to biased composition alone. Any envelopes that
survive stages F1 through F5b are then passed with the local CM CYK filter.
The CYK filter uses constraints (bands) derived from an HMM alignment of the
envelope to reduce the number of required calculations and save time. Any
envelopes that pass CYK are scored with the local CM Inside algorithm, again
using HMM bands for acceleration.
The default filter thresholds that define the minimum score required for a
subsequence to survive each stage are defined based on the size of the
database in <seqdb>
(or the size <x>
(Mb) specified by the -Z <x>
options). For larger databases, the filters are more strict
leading to more acceleration but potentially a greater loss of sensitivity.
The rationale is that for larger databases, hits must have higher scores to
achieve statistical significance, so stricter filtering that removes lower
scoring insignificant hits is acceptable.
The P-value thresholds for all possible search space sizes and all filter stages
are listed next. (A P-value threshold of 0.01 means that roughly 1% of the
highest scoring nonhomologous subsequence are expected to pass the filter.) Z
is defined as the number of nucleotides in the complete target sequence file
times 2 because both strands will be searched with each model.
If Z is less than 2 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b and F5
are 0.02; F6 is 0.0001.
If Z is between 2 Mb and 20 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b
and F5 are 0.005; F6 is 0.0001.
If Z is between 20 Mb and 200 Mb: F1 is 0.35; F2 and F2b are 0.15; F3, F3b, F4,
F4b and F5 are 0.003; F6 is 0.0001.
If Z is between 200 Mb and 2 Gb: F1 is 0.15; F2 and F2b are 0.15; F3, F3b, F4,
F4b, F5, and F5b are 0.0008; and F6 is 0.0001.
If Z is between 2 Gb and 20 Gb: F1 is 0.15; F2 and F2b are 0.15; F3, F3b, F4,
F4b, F5, and F5b are 0.0002; and F6 is 0.0001.
If Z is more than 20 Gb: F1 is 0.06; F2 and F2b are 0.02; F3, F3b, F4, F4b, F5,
and F5b are 0.0002; and F6 is 0.0001.
These thresholds were chosen based on performance on an internal benchmark
testing many different possible settings.
There are five options for controlling the general filtering level. These
options are, in order from least strict (slowest but most sensitive) to most
strict (fastest but least sensitive): --max, --nohmm,
(this is the default setting), --rfam.
the filter thresholds will be
database-size dependent. See the explanation of each of these individual
options below for more information.
Additionally, an expert user can precisely control each filter stage score
threshold with the --F1, --F1b, --F2, --F2b,
--F3, --F3b, --F4, --F4b, --F5,
options. As well as turn each stage on or off
with the --noF1, --doF1b, --noF2, --noF2b,
--noF3, --noF3b, --noF4, --noF4b, --noF5,
options. These options are only displayed if the
option is used to keep the number of displayed options with
reasonable, and because they are only expected to be useful to a
small minority of users.
As a special case, for any models in <cmfile>
which have zero
basepairs, profile HMM searches are run instead of CM searches. HMM algorithms
are more efficient than CM algorithms, and the benefit of CM algorithms is
lost for models with no secondary structure (zero basepairs). These profile
HMM searches will run significantly faster than the CM searches. You can force
HMM-only searches with the --hmmonly
option. For more information on
HMM-only searches see the description of the --hmmonly
and the user guide.
- Turn off all filters, and run non-banded Inside on every
full-length target sequence. This increases sensitivity somewhat, at an
extremely large cost in speed.
- Turn off all HMM filter stages (F1 through F5b). The CYK
filter, using QDBs, will be run on every full-length target sequence and
will enforce a P-value threshold of 0.0001. Each subsequence that survives
CYK will be passed to Inside, which will also use QDBs (but a looser set).
This increases sensitivity somewhat, at a very large cost in speed.
- Turn off the HMM SSV and Viterbi filter stages (F1 through
F2b). Set remaining HMM filter thresholds (F3 through F5b) to 0.02 by
default, but changeable to <x> with --Fmid
<x> sequence. This may increase sensitivity, at a significant
cost in speed.
- Use the default filtering strategy. This option is on by
default. The filter thresholds are determined based on the database size.
- Use a strict filtering strategy devised for large databases
(more than 20 Gb). This will accelerate the search at a potential cost to
sensitivity. It will have no effect if the database is larger than 20 Gb.
- Only use the filter profile HMM for searches, do not use
the CM. Only filter stages F1 through F3 will be executed, using strict
P-value thresholds (0.02 for F1, 0.001 for F2 and 0.00001 for F3).
Additionally a bias composition filter is used after the F1 stage (with
P=0.02 survival threshold). Any hit that survives all stages and has an
HMM E-value or bit score above the reporting threshold will be output. The
user can change the HMM-only filter thresholds and options with
--hmmF1, --hmmF2, --hmmF3, --hmmnobias,
--hmmnonull2, and --hmmmax. By default, searches for any
model with zero basepairs will be run in HMM-only mode. This can be turned
off, forcing CM searches for these models with the --nohmmonly
option. These options are only displayed if the --devhelp option is
- --FZ <x>
- Set filter thresholds as the defaults used if the database
were <x> megabases (Mb). If used with <x>
greater than 20000 (20 Gb) this option has the same effect as
- --Fmid <x>
- With the --mid option set the HMM filter thresholds
(F3 through F5b) to <x>. By default, <x> is
- Turn off truncated hit detection.
- Allow truncated hits to begin and end at any position in a
target sequence. By default, 5' truncated hits must include the first
residue of their target sequence and 3' truncated hits must include the
final residue of their target sequence. With this option you may observe
fewer full length hits that extend to the beginning and end of the query
- Turn off the null3 CM score corrections for biased
composition. This correction is not used during the HMM filter stages.
- --mxsize <x>
- Set the maximum allowable CM DP matrix size to
<x> megabytes. By default this size is 128 Mb. This should be
large enough for the vast majority of searches, especially with smaller
models. If cmsearch encounters an envelope in the CYK or Inside
stage that requires a larger matrix, the envelope will be discounted from
consideration. This behavior is like an additional filter that prevents
expensive (slow) CM DP calculations, but at a potential cost to
sensitivity. Note that if cmsearch is being run in <n>
multiple threads on a multicore machine then each thread may have an
allocated matrix of up to size <x> Mb at any given time.
- --smxsize <x>
- Set the maximum allowable CM search DP matrix size to
<x> megabytes. By default this size is 128 Mb. This option is
only relevant if the CM will not use HMM banded matrices, i.e. if the
--max, --nohmm, --qdb, --fqdb,
--nonbanded, or --fnonbanded options are also used. Note
that if cmsearch is being run in <n> multiple threads
on a multicore machine then each thread may have an allocated matrix of up
to size <x> Mb at any given time.
- Use the CYK algorithm, not Inside, to determine the final
score of all hits.
- Use the CYK algorithm to align hits. By default, the
Durbin/Holmes optimal accuracy algorithm is used, which finds the
alignment that maximizes the expected accuracy of all aligned residues.
- --wcx <x>
- For each CM, set the W parameter, the expected maximum
length of a hit, to <x> times the consensus length of the
model. By default, the W parameter is read from the CM file and was
calculated based on the transition probabilities of the model by
cmbuild. You can find out what the default W is for a model using
cmstat. This option should be used with caution as it impacts the
filtering pipeline at several different stages in nonobvious ways. It is
only recommended for expert users searching for hits that are much longer
than any of the homologs used to build the model in cmbuild, e.g.
ones with large introns or other large insertions. This option cannot be
used in combination with the --nohmm, --fqdb or --qdb
options because in those cases W is limited by query-dependent bands.
- Only search the top (Watson) strand of target sequences in
<seqdb>. By default, both strands are searched. This will
halve the database size (Z).
- Only search the bottom (Crick) strand of target sequences
in <seqdb>. By default, both strands are searched. This will
halve the database size (Z).
- --tformat <s>
- Assert that the target sequence database file is in format
<s>. Accepted formats include fasta, embl,
genbank, ddbj, stockholm, pfam, a2m,
afa, clustal, and phylip The default is to autodetect
the format of the file.
- --cpu <n>
- Set the number of parallel worker threads to
<n>. By default, Infernal sets this to the number of CPU
cores it detects in your machine - that is, it tries to maximize the use
of your available processor cores. Setting <n> higher than
the number of available cores is of little if any value, but you may want
to set it to something less. You can also control this number by setting
an environment variable, INFERNAL_NCPU. This option is only
available if Infernal was compiled with POSIX threads support. This is the
default, but it may have been turned off at compile-time for your site or
machine for some reason.
- For debugging the MPI master/worker version: pause after
start, to enable the developer to attach debuggers to the running master
and worker(s) processes. Send SIGCONT signal to release the pause. (Under
gdb: (gdb) signal SIGCONT) (Only available if optional MPI support
was enabled at compile-time.)
- Run in MPI master/worker mode, using mpirun. To use
--mpi, the sequence file must have first been 'indexed' using the
esl-sfetch program, which is included with Infernal, in the
easel/miniapps/ subdirectory. (Only available if optional MPI
support was enabled at compile-time.)
for a master man page with a list of all the individual
man pages for programs in the Infernal package.
For complete documentation, see the user guide that came with your Infernal
distribution (Userguide.pdf); or see the Infernal web page ().
Copyright (C) 2016 Howard Hughes Medical Institute.
Freely distributed under a BSD open source license.
For additional information on copyright and licensing, see the file called
COPYRIGHT in your Infernal source distribution, or see the Infernal web page
The Eddy/Rivas Laboratory
Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147 USA