cmemit - sample sequences from a covariance model
cmemit [options] <cmfile>
program samples (emits) sequences from the covariance model(s)
and writes them to output. Sampling sequences may be
useful for a variety of purposes, including creating synthetic true positives
for benchmarks or tests.
The default is to sample ten unaligned sequence from each CM. Alternatively,
with the -c
option, you can emit a single majority-rule consensus
sequence; or with the -a
option, you can emit an alignment.
may contain a library of CMs, in which case each CM
will be used in turn.
may be '-' (dash), which means reading this input from
rather than a file.
For models with zero basepairs, sequences are sampled from the profile HMM
filter instead of the CM. However, since these models will be nearly identical
(unless special options were used in cmbuild
to prevent this), using
the HMM instead of the CM will not change the output in a significant way,
unless the -l
option is used. With -l,
the HMM will be
configured for equiprobable model begin and end positions, while the CM will
not. You can force cmemit
to always sample from the CM with the
- Help; print a brief reminder of command line usage and
- -o <f>
- Save the synthetic sequences to file <f>
rather than writing them to stdout.
- -N <n>
- Generate <n> sequences. The default value for
<n> is 10.
- Write the generated sequences in unaligned format (FASTA).
This is the default behavior.
- Write the generated sequences in an aligned format
(STOCKHOLM) with consensus structure annotation rather than FASTA. Other
output formats are possible with the --outformat option.
- Predict a single majority-rule consensus sequence instead
of sampling sequences from the CM´s probability distribution.
Highly conserved residues (base paired residues that score higher than 3.0
bits, or single stranded residues that score higher than 1.0 bits) are
shown in upper case; others are shown in lower case.
- -e <n>
- Embed the CM emitted sequences in a larger randomly
generated sequence of length <n> generated from an HMM that
was trained on real genomic sequences with various GC contents (the same
HMM used by cmcalibrate). You can use the --iid option to
generate 25% A, C, G, and U sequence instead. The CM emitted sequence will
begin at a random position within the larger sequence and will be included
in its entirety unless the --u5p or --u3p options are used.
When -e is used in combination with --u5p, the CM emitted
sequence will always begin at position 1 of the larger sequence and will
be truncated 5'. When used in combination --u3p the CM emitted
sequence will always end at position <n> of the larger
sequence and will be truncated 3'.
- Configure the CMs into local mode before emitting
sequences. By default the model will be in global mode. In local mode,
large insertions and deletions are more common than in global mode.
- Truncate all emitted sequences at a randomly chosen start
position <n>, by only outputting residues beginning at
<n>. A different start point is randomly chosen for each
- Truncate all emitted sequences at a randomly chosen end
position <n>, by only outputting residues up to position
<n>. A different end point is randomly chosen for each
- --a5p <n>
- In combination with the -a option, truncate the
emitted alignment at a randomly chosen start match position
<n>, by only outputting alignment columns for positions after
match state <n> - 1. <n> must be an integer
between 0 and the consensus length of the model (which can be determined
using the cmstat program. As a special case, using 0 as
<n> will result in a randomly chosen start position.
- --a3p <n>
- In combination with the -a option, truncate the
emitted alignment at a randomly chosen end match position
<n>, by only outputting alignment columns for positions
before match state <n> + 1. <n> must be an
integer between 1 and the consensus length of the model (which can be
determined using the cmstat program). As a special case, using 0 as
<n> will result in a randomly chosen end position.
- --seed <n>
- Seed the random number generator with <n>, an
integer >= 0. If <n> is nonzero, stochastic sampling of
sequences will be reproducible; the same command will give the same
results. If <n> is 0, the random number generator is seeded
arbitrarily, and stochastic samplings will vary from run to run of the
same command. The default seed is 0.
- With -e, generate the larger sequences as 25% each
A, C, G and U.
- Specify that the emitted sequences be output as RNA
sequences. This is true by default.
- Specify that the emitted sequences be output as DNA
sequences. By default, the output alphabet is RNA.
- --idx <n>
- Specify that the emitted sequences be named starting with
<modelname>.<n>. By default <n> is 1.
- --outformat <s>
- With -a, specify the output alignment format as
<s>. Acceptable formats are: Pfam, AFA, A2M, Clustal, and
Phylip. AFA is aligned fasta. Only Pfam and Stockholm alignment formats
will include consensus structure annotation.
- --tfile <f>
- Dump tabular sequence parsetrees (tracebacks) for each
emitted sequence to file <f>. Primarily useful for debugging.
- --exp <x>
- Exponentiate the emission and transition probabilities of
the CM by <x> and then renormalize those distributions before
emitting sequences. This option changes the CM probability distribution of
parsetrees relative to default. With <x> less than 1.0 the
emitted sequences will tend to have lower bit scores upon alignment to the
CM. With <x> greater than 1.0, the emitted sequences will tend to
have higher bit scores upon alignment to the CM. This bit score difference
will increase as <x> moves further away from 1.0 in either
direction. If <x> equals 1.0, this option has no effect relative to
default. This option is useful for generating sequences that are either
more difficult ( <x> < 1.0) or easier ( <x>
> 1.0) for the CM to distinguish as homologous from background, random
- Emit from the filter profile HMM instead of the CM.
- Never emit from the filter profile HMM, always use the CM,
even for models with zero basepairs.
for a master man page with a list of all the individual
man pages for programs in the Infernal package.
For complete documentation, see the user guide that came with your Infernal
distribution (Userguide.pdf); or see the Infernal web page ().
Copyright (C) 2016 Howard Hughes Medical Institute.
Freely distributed under a BSD open source license.
For additional information on copyright and licensing, see the file called
COPYRIGHT in your Infernal source distribution, or see the Infernal web page
The Eddy/Rivas Laboratory
Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147 USA