Section: Infernal Manual (1)Updated: October 2009Local indexUp
NAME
cmsearch - search a sequence database for RNAs homologous to a CM
SYNOPSIS
cmsearch[options]cmfileseqfile
DESCRIPTION
cmsearch
uses the
covariance model (CM) in
cmfile
to search for homologous RNAs in
seqfile,
and outputs high-scoring alignments.
Currently, the sequence file must be in FASTA format.
CMs are profiles of RNA consensus sequence and secondary structure. A
CM file is produced by the
cmbuild
program, from a given RNA sequence alignment of known
consensus structure.
CM files can be calibrated prior to running
cmsearch
with the
cmcalibrate
program. Searches with calibrated CM files will include
E-values and will use appropriate filter thresholds for
acceleration. It is strongly recommended to calibrate your
CM files before using
cmsearch.
CM calibration is described in more detail below
and in chapters 5 and 6 of the User's Guide.
cmsearch
output consists of alignments of all hits in the database
sorted by decreasing score per sequence
and per strand. That is, all hits for the same sequence and the same
(Watson or Crick) strand are sorted, but hits across sequences or
strands are not sorted.
The threshold for reporting scores is different depending on whether
the CM file has been calibrated or not.
If the CM file has been calibrated, the default reporting threshold is
an E-value of 1.0. This is the threshold at which 1 hit is
expected by chance. It is possible to manually set the threshold to
bit score
<x>
using the
-T <x>
option as described below, or to E-value
<x>
using the
-E <x>
option. The
-E
option will only work if the CM file has been calibrated.
RNA homology search with CMs is slow.
To speed it up,
cmsearch
by default uses two rounds of filters with faster algorithms to prune the
database prior to searching with the slow CM algorithm.
The first round of filtering is faster but less strict
than the second round. First, the full database is searched with the
first round filter, then any hits that survive the first round
are searched with the second round
filter. Finally any hits that survive the first and second round of
filtering are searched with the final round search strategy.
During the filter rounds, hits are padded with a short stretch of
residues on either side prior to searching with the subsequent round.
The exact number of residues is dependent on the size of the model
being searched with.
The first round of filtering is performed with an HMM. If the CM file
is calibrated, the threshold for the HMM filter will be automatically
chosen as an appropriate one as determined in
cmcalibrate.
The minimum threshold that will automatically be chosen is the threshold
that will allow a predicted fraction (0.02 by default, changeable to
<x>
with
--fil-Smin-hmm <x>
) of the database to survive the filter.
The maximum threshold that will automatically be chosen is the
threshold that will allow a predicted fraction (0.5 by default,
changeable to
<x>
with
--fil-Smax-hmm <x>
) of the database to survive the filter. If the threshold from
cmcalibrate
is greater than this maximum fraction, the HMM filter will be turned off and not used.
To ensure that the HMM filter is never turned off and
always uses a threshold that gives this maximum fraction you must use the
--fil-A-hmm
option.
If the model is not calibrated, the default HMM filter threshold is
3.0 bits. The HMM filter threshold can be manually set to bit score
<x>
using the
--fil-T-hmm <x>
option as described below, or to E-value
<x>
using the
--fil-E-hmm <x>
option, or to the bit score that will allow a predicted database fraction of
<x>
to survive the filter using the
--fil-S-hmm <x>
option. The
--fil-E-hmm
and
--fil-S-hmm
options will only work if the CM file has been calibrated.
The HMM filter can be turned off with the
--fil-no-hmm
option.
The second round of filtering is performed with the CM CYK algorithm
(not an HMM) using query-dependent banding (QDB) for acceleration.
Briefly, QDB precalculates regions of the dynamic
programming matrix that have negligible probability based on the query
CM's transition probabilities.
During search, these regions of the
matrix are ignored to make searches faster.
For more information on QDB see
(Nawrocki and Eddy, PLoS Computational Biology 3(3): e56).
The beta paramater is the amount of
probability mass considered negligible during band calculation, lower
values of beta yield greater speedups but also a greater chance of missing
the optimal alignment. The default beta is 1E-10: determined
empirically as a good tradeoff between sensitivity and speed, though
this value can be changed with the
--fil-beta <x>
option.
If the CM file has been calibrated, the QDB filter threshold will be
automatic set to an appropriate value using an ad-hoc procedure (see
the User's Guide). If the CM file has not been calibrated, the default
QDB filter threshold is 0.0 bits.
The QDB filter threshold can be manually set to bit score
<x>
using the
--fil-T-qdb <x>
option as described below, or to E-value
<x>
using the
--fil-E-qdb <x>
option. The
--fil-E-qdb
option will only work if the CM file has been calibrated.
The QDB filter can be turned off with the
--fil-no-qdb
option.
Another way to accelerate
cmsearch
is to run it in parallel with MPI on multiple computers.
To do this, use the
--mpi
option and run
cmsearch
inside a MPI wrapper program such as
mpirun.
For example:
mpirun Ccmsearch--mpi [other options]cmfileseqfile.
The
cmsearch
program must have been compiled in MPI mode for this to work.
See the Installation section of the User's Guide for more information.
The
--forecast <n>
option will estimate how long a search will take for your
cmfile
and
seqfile
on
<n>
processors. Unless you plan on running
cmsearch
in MPI mode,
<n>
should be set as 1.
Another technique for accelerated CM homology search with HMM filters
is the construction and use of a "rigorous filter" HMM which was
developed by Zasha Weinberg and Larry Ruzzo. All hits above a certain
CM bit score threshold are guaranteed to survive the HMM filtering
step. Their implementation of rigorous filters has been included in
previous versions of Infernal, but not in the current version. For
more information see the User's Guide.
OUTPUT
By default,
cmsearch
outputs the alignments of search hits that score above the final search
round threshold. The format of this output is described in the "Tutorial"
section of the User's Guide. This format has purposefully not
been changed from the 0.x versions of Infernal so as not to break
existing parsers. However, it can be augmented with a line of
output that marks non-compensatory (negative scoring) basepairs with
an 'x' by using the
-x
option. Alternatively, only negative scoring non-canonical basepairs
(those other than A:U, U:A, C:G, G:C, U:G, and G:U) are marked if the
-v
option is enabled. These two options were added to facilitate quick
analysis of the secondary structure of hits by eye. Additionally, the
-p
option can be used to annotate the posterior probability of each
aligned residue in the hit alignments as described below.
The
--tabfile <f>
outputs a tabular representation of the hits found by
cmsearch
to the file
<f>.
Each non-# prefixed line of this file corresponds to a hit, and each
such line has 9 fields: "model name" the name of the CM used for the
search, "target name" the name of the target sequence
the hit was found in, "target coord - start": the start position of
the hit in the target sequence, "target coord - stop": the end position of
hit in the target sequence, "query coord - start":
the start position of the hit in the query model, "query coord - stop": the end position of
hit in the query sequence, "bit sc": the bit score of the hit, "E-value":
the E-value of the hit (if available, "-" if not), and "GC" the
percentage of G and C residues in the hit within the target sequence.
cmsearch
tab files can be used as input to the Easel miniapp
esl-sfetch
(included in the easel/miniapp/ subdirectory of infernal) with the
-C -f --tabfile
options to extract all the hits from the target database file to a new
FASTA file. This file can then be aligned to a CM with
cmalign.
OPTIONS
-h
Print brief help; includes version number and summary of
all options, including expert options.
-o <f>
Save the high-scoring alignments of hits to a file
<f>.
The default is to write them to standard output.
-g <f>
Turn on the 'glocal' alignment algorithm, local with respect to the
target database, and global with respect to the model. By default,
the local alignment algorithm is used which is local with respect to
both the target sequence and the model. In local mode, the alignment
to span two or more subsequences if necessary (e.g. if the structures
of the query model and target sequence are only partially shared),
allowing certain large insertions and deletions in the structure
to be penalized differently than normal indels.
Local mode performs better on empirical benchmarks and is
significantly more sensitive for remote homology detection.
Empirically, glocal searches return many fewer hits than
local searches, so glocal may be desired for some applications.
-p
Append posterior probabilities to alignments of hits. For more
information on posterior probabilities see the description of the
-p
option in the manual page for
cmalign.
-x
Annotate negative scoring basepairs and basepairs that include a gap
in the left or right half of the pair (but not both) with x's in the
alignments of hits. The x's appear above the structural annotation in
the alignment output. Basepairs without x's above them are
compensatory with respect to the model. Compensatory mutations are
good evidence for structural homology.
-v
Very similar to -x, but only mark negative scoring basepairs that are
non-canonical basepairs (not an A:U, U:A, C:G, G:C, G:U or U:G), and
mark them with a 'v' instead of an 'x' in the output.
-Z <x>
Calculate E-values as if the target database size was
<x>
megabases (Mb). Ignore the actual size of the database. This option
is only valid if the CM file has been calibrated. Warning: the
predictions for timings and survival fractions will be calculated as
if the database was of size
<x>
Mb, which means they will be inaccurate.
--toponly
Only search the top (Watson) strand of the sequences in
seqfile.
By default, both strands are searched.
--bottomonly
Only search the bottom (Crick) strand of the sequences in
seqfile.
By default, both strands are searched.
--forecast <n>
Predict the running time of the search with provided files and options
and exit,
DO NOT
perform the search. This option is only available
with calibrated CM files. The predictions should be used as rough
estimates and can be fairly inaccurate, especially for highly biased
target databases (for example 80% AT genomes). The value for
<n>
is the number of processors the search will be run on, so
<n>
equal to 1 is appropriate unless you will run
cmsearch
in parallel with MPI.
--informat <s>
Assert that the input
seqfile
is in format
<s>.
Do not run Babelfish format autodection. This increases
the reliability of the program somewhat, because
the Babelfish can make mistakes; particularly
recommended for unattended, high-throughput runs
of Infernal.
<s>
is case-insensitive.
Acceptable formats are: FASTA, EMBL, UNIPROT, GENBANK, and DDBJ.
<s>
is case-insensitive.
--mxsize <x>
Set the maximum allowable DP matrix size to
<x>
megabytes. By default this size is 2,048 Mb.
This should be large enough for the vast majority of alignments,
however if it is not
cmsearch
will exit prematurely and report an error message that
the matrix exceeded it's maximum allowable size. In this case, the
--mxsize
can be used to raise the limit.
--devhelp
Print help, as with
-h,
but also include undocumented developer options. These options are not
listed below, are under development or experimental, and are not
guaranteed to even work correctly. Use developer options at your own
risk. The only resources for understanding what they actually do are
the brief one-line description printed when
--devhelp
is enabled, and the source code.
--mpi
Run as an MPI parallel program. This option will only be available if
Infernal
has been configured and built with the "--enable-mpi" flag (see User's
Guide for details).
EXPERT OPTIONS
--inside
Use the Inside algorithm for the final round of searching. This is
true by default.
--cyk
Use the CYK algorithm for the final round of searching.
--forward
Search only with an HMM. This is much faster but less sensitive than a
CM search. Use the Forward algorithm for the HMM search.
--viterbi
Search only with an HMM. This is much faster but less sensitive than a
CM search. Use the Viterbi algorithm for the HMM search.
-E <x>
Set the E-value cutoff for the per-sequence/strand ranked hit list to
<x>,
where
<x>
is a positive real number. Hits with E-values
better than (less than) or equal to this threshold will be shown. This
option is only available if the CM file has been calibrated. This
threshold is relevant only to the final round of searching performed
after all filters have been used, not to the filter rounds themselves.
-T <x>
Set the bit score cutoff for the per-sequence ranked hit list to
<x>,
where
<x>
is a positive real number.
Hits with bit scores better than (greater than) this threshold
will be shown. This
threshold is relevant only to the final round of searching performed
after all filters have been used, not to the filter rounds themselves.
--nc
Set the bit score cutoff as the NC cutoff value used by Rfam curators
as the noise cutoff score. This is the highest scoring hit found by
this model during Rfam curation that the Rfam curators defined as a
noise (false positive) sequence.
The NC cutoff is defined as
<x>
bits in the original
Stockholm alignment the model was built from
with a line:
#=GF NC <x>
positioned before the sequence alignment. If such a line existed in the
alignment provided to
cmbuild
then the
--nc
option will be available in
cmsearch.
If no such line existed when
cmbuild
was run, then using the
--nc
option to
cmsearch
will cause the program to print an error message and exit.
--ga
Set the bit score cutoff as the GA cutoff value used by Rfam curators
as the gathering threshold. The GA cutoff is defined in a stockholm
file used to build the model in the same way as the NC cutoff (see above),
but with a line:
#=GF GA <x>
--tc
Set the bit score cutoff as the TC cutoff value used by Rfam curators
as the trusted cutoff. The TC cutoff is defined in the stockholm file
used to build the model in the same way as the NC cutoff (see above),
but with a line:
#=GF TC <x>
--no-qdb
Do not use query-dependent banding (QDB) for the final round of
search. By default, QDB is used in the final round of search with
beta = 1E-15, after all filtering is finished.
--beta <x>
For query-dependent banding (QDB) during the final round of search,
set the beta parameter to
<x>
where
<x>
is any positive real number less than 1.0. Beta is the probability
mass considered negligible during band calculation. The default beta
for the final round of search is 1E-15.
--hbanded
Use HMM bands to accelerate the final round of search. Constraints for
the CM search are derived from posterior probabilities from an HMM.
This is an experimental option and it is not recommended for use
unless you know exactly what you're doing.
--tau <x>
Set the tail loss probability during HMM band calculation to
<x>.
This is the amount of probability mass within the HMM posterior
probabilities that is considered negligible. The default value is 1E-7.
In general, higher values will result in greater acceleration, but
increase the chance of missing the optimal alignment due to the HMM
bands. This option only makes sense in combination with
--hbanded
--fil-no-hmm
Turn the HMM filter off.
--fil-no-qdb
Turn the QDB filter off.
--fil-beta
For the QDB filter,
set the beta parameter to
<x>
where
<x>
is any positive real number less than 1.0. Beta is the probability
mass considered negligible during band calculation. The default beta
for the QDB filter round of search is 1E-10.
--fil-T-qdb <x>
Set the bit score cutoff for the QDB filter round to
<x>,
where
<x>
is a positive real number.
Hits with bit scores better than (greater than) this threshold
will survive the QDB filter and be passed to the final round.
--fil-T-hmm <x>
Set the bit score cutoff for the HMM filter round to
<x>,
where
<x>
is a positive real number.
Hits with bit scores better than (greater than) this threshold
will survive the HMM filter and be passed to the next round, either
a QDB filter round, or if the QDB filter is disabled, to
the final round of search.
--fil-E-qdb <x>
Set the E-value cutoff for the QDB filter round.
<x>,
where
<x>
is a positive real number. Hits with E-values
better than (less than) or equal to this threshold will survive and be
passed to the final round. This
option is only available if the CM file has been calibrated.
--fil-E-hmm <x>
Set the E-value cutoff for the HMM filter round.
<x>,
where
<x>
is a positive real number. Hits with E-values
better than (less than) or equal to this threshold will survive and be
passed to the next round, either a QDB filter round, or if the
QDB filter is disable, to the final round of search. This
option is only available if the CM file has been calibrated.
--fil-S-hmm <x>
Set the bit score cutoff for the HMM filter round as the score that
will allow a predicted
<x>
fraction of the database to survive the HMM filter round,
where
<x>
is a positive real number between 0 and 1.
--fil-Smax-hmm <x>
When using automatically calibrated HMM thresholds for a CM file
calibrated with
cmcalibrate,
set the maximum HMM filter threshold as the score that will allow a
predicted
<x>
fraction of the database to survive the filter. If the automatic
threshold from
cmcalibrate
exceeds this value, turn the HMM filter off and do not use it for the
search. By default, this option is ON with the default value of 0.5
used for
<x>.
To modify the behavior of this
option so it does not turn off the HMM filter if exceeded use the
--fil-A-hmm
option described below.
--fil-Smin-hmm <x>
When using automatically calibrated HMM thresholds for a CM file
calibrated with
cmcalibrate,
set the minimum HMM filter threshold as the score that will allow a
predicted
<x>
fraction of the database to survive the filter. By default, this
option is ON with the default value of 0.02 used for
<x>.
Setting
<x> lower will only accelerate the majority of
searches by a small amount.
--fil-A-hmm
Always enforce the maximum HMM filter threshold of
<x>
from
--fil-Smax-hmm <x>.
That is, never turn off the HMM filter, or set its threshold above the
score that will allow a predicted
<x>
fraction of the database to survive. This option is OFF by default.
--hmm-W <n>
Set the HMM window size W (maximum size of a hit) to
<n>.
This option only works in combination with
--forward
or
--viterbi.
By default, W is calculated automatically, but this automatic calculation is
time consuming for large models.
--hmm-cW <x>
Set the HMM window size W (maximum size of a hit) as
<x>
times the consensus length of the CM. The consensus length (clen) of the CM
can be determined using the
cmstat
program. This option only works in combination with
--forward
or
--viterbi.
By default, W is calculated automatically, but this automatic calculation is
time consuming for large models. To find potential full length hits to
the model
<x>
should be greater than 1.0, but values above 2.0 are probably wasteful.
--noalign
Do not calculate and print alignments of each hit, only print locations
and scores.
--aln-hbanded
Use HMM bands to accelerate alignment during the hit alignment stage.
--aln-optacc
Calculate alignments of hits from final round of search using the
optimal accuracy algorithm which computes the alignment that maximizes
the summed posterior probability of all aligned residues
given the model, which can be different from the highest
scoring one.
--tabfile <f>
Create a new output file
<f>
and print tabular results to it.
The format of the tabular results is listed in the
OUTPUT
section. The tabular results can be more easily parsed by scripts than
the default
cmsearch
output. The
esl-sfetch
miniapp included in the easel/miniapps/ subdirectory of infernal has a
--tabfile
option that allows it to read
cmsearch
tab files and fetch the hits reported within them from the target
database into a new sequence file.
--gcfile <f>
Create a new output file
<f>
and print statistics of the GC content of the sequences in
seqfile
to it.
The sequences are partitioned into 100 nt non-overlapping windows, and
the GC percentage of each window is calculated. A normalized histogram
of those GC percentages is then printed to
<f>
This file can be generated even if
cmsearch
is run with
--forecast
and no search is performed.
--rna
Output the hit alignments as RNA sequences alignments. This is true by default.
--dna
Output the hit alignments as DNA sequence alignments.
SEE ALSO
For complete documentation, see the User's Guide (Userguide.pdf) that
came with the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source
for details on redistribution conditions.
AUTHOR
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/