anfo-tool - process native ANFO binary files
anfo-tool [ option | pattern ... ]
is used to filter, process and convert the files created by
. Every pattern on the command line is wildcard expanded, then for
every input file (or the standard input, if no pattern is given),
builds a chain of input filters, it then merges these input
streams in one of several ways, splits the result up into multiple output
streams, each of which can have a chain of output filter applied.
These options apply globally and modify the behavior of the whole program. They
can be placed anywhere in the command line.
- -V, --version
- Print version number and exit.
- -q, --quiet
- Suppress all output except fatal error messages.
- -v, --verbose
- Produce more output, including progress indicators for most
- Produce debugging output in addition to progress
- -n, --dry-run
- Parse command line, optionally print a description of the
intended operations, then exit.
- --vmem X
- Limit virtual memory to X megabytes. If memory runs
out, anfo-tool tries to free up memory by forgetting about big
files, e.g. genomes. Use this option to avoid swapping or out-of-memory
conditions when operations involve big or multiple genomes.
A parameter can be set multiple times on the command line and will overwrite
previous settings. Any filter option that needs a parameter picks up the last
definition that appeared before the filter option.
- --set-slope S
- Set the slope parameter to S. The slope is
used together with the intercept where filters apply to alignment
scores; alignments scoring no worse than slope * (length -
intercept) are considered good. The default is 7.5.
- --set-intercept L
- Set the intercept parameter to L. The
intercept is used together with the slope where filters
apply to alignment scores; alignments scoring no worse than slope *
(length - intercept) are considered good. The default is 20.
- --set-context C
- Set the context parameter to C. The context is the
number of surrounding bases of the reference included when printing
alignments in text form. The default is 0.
- --set-genome G
- Set the genome parameter to G. Many filters will
only consider the best alignments to this specific genome if it is set. If
no genome is set, the globally best alignment is used.
- Clear the genome parameter. Filters apply to the globally
best alignment afterwards.
Filters can be applied before merging the inputs or after splitting the back up.
- -s, --sort-pos=n
- Sort by alignment position while buffering no more than
n MiB in memory. If a genome is set, alignments to that genome are
- -S, --sort-name=n
- Sort by read name while buffering no more than n MiB
- -l, --filter-length=L
- Retain alignments only for reads of at least L bases
length. The reads themselves are kept.
- -f, --filter-score
- Retain alignments only if their score is good enough.
- Remove alignments with mapping quality below Q.
- -h, --filter-hit=SEQ
- Keep only reads that have a hit to a sequence named
SEQ. If SEQ is empty, reads are kept if they have any hit.
If the genome parameter is set, only hits to that genome count.
- Delete alignments to SEQ. If SEQ is empty,
all alignments are deleted. If the genome parameter is set, only
alignments to that genome are deleted.
- Mask out bases with quality below Q. Such a base is
replaced by the N ambiguity code.
- Keep only reads of molecules that have been sequenced at
least N times. Reads are considered to come from the same original
molecule if their aligned coordinates are identical.
- Subsample a fraction F of the results. Every read is
independently and randomly choosen to be kept or not.
- Read a list of regions from FILE, then keep only
alignments that overlap an annotated region.
- Read a list of regions from FILE, then keep only
alignments that do not overlap an annotated region.
- -d, --rmdup=Q
- Remove PCR duplicates, clamp quality scores to Q.
Two reads are considered to be duplicates, if their aligned coordinates
are identical. If a genome is set, the best alignment to that
genome is used, else the globally best alignment. Both alignments must be
good, as determined by slopeandintercept. For a set of
duplicates, a consensus is called, generally increasing the quality
scores. If a resulting quality score exceeds Q, it is set to
Q. This filter requires the input to be sorted by alignment
coordinate on the selected genome.
--duct-tape=NAME Duct-tape overlapping alignments into contigs and
call a consensus for them. If a genome is set, alignments to that
genome are used, else the globally best alignments. This filter requires
input to be sorted by alignment coordinate on the genome. Output is a set
of contigs, every position gets assigned a consensus base, a quality score
and likelihoods for every possible diallele. (It is called duct-taping
because it kind of looks like an assembly, but is not nearly as solid.)
- Invoke the editor ED on the text representaion of
the stream's header. This can be used to clean up header that have
accumulated too much cruft.
Exactly one merging filter should be given on the command line, all filter
options occuring before that are part of the input filter chains, all further
filters become output chains. If no merging filter is given, --concat
is assumed, and all filters are input filters.
- -c, --concat
- Concatenate all input streams in the order they appear on
the command line.
- -m, --merge
- Merge sorted input streams, producing a sorted result. All
inputs must be sorted in the same way.
- -j, --join
- Join input streams and retain the single best hits to each
genome. Every input stream must contain a record for every read, reads are
buffered in memory until all of their hits are collected. This way,
joining works well if all inputs are nearly in the same order. If reads
are missing from some streams, joining them will waste memory.
- Merge many streams such as those produced by running
anfo-sge. Streams that operated on the same reads are joined, then
everything is merged.
If an output option is given on the command line, the current output filter
chain is ended and a new one is started. If no output option is given, a
textual representation of the final stream is written to stdout
output options accept -
to write to stdout.
- -o, --output FILE
- Write native binary stream (a compressed protobuf message)
to FILE. Writing a binary stream and reading it back in is
- --output-text FILE
- Write protobuf text stream to FILE. If the necessary
genomes are available, a textual representation of the alignments is
included. If the context parameter is set, that many additional
bases of the reference upstream and downstream from the alignment are
- Write alignments in SAM format to FILE.
- --output-glz FILE
- Write contigs in GLZ 0.9 format to FILE. Generating
GLZ only works after application of --duct-tape, every contigs
becomes a GLZ record.
- --output-3aln FILE
- Write contigs in a table based format to FILE. The
format is still subject to change, see the source code for detailed
- --output-fasta FILE
- Write alignments(!) in FastA format to FILE.
Alignments are writte as pair of reference and query sequence, aligned
coordinates are indicated in the description of the query sequence. If the
context parameter is set, that many additional bases of the
reference upstream and downstream from the alignment are included. This
format is not suggested for any serious use, it exists to support legacy
- --output-fastq FILE
- Write sequences(!) in FastQ format to FILE. Writing
FastQ effectively reconstitutes the input to ANFO if no filtering
was done on the results.
- --output-table FILE
- Write per-alignment statistics to FILE. The file has
three colums:Âsequence length, alignment score, difference to next
best alignment. It is mainly useful to analyze/visualize the distribution
of alignment scores.
- --stats FILE
- Write simple statistics to FILE. This results in
some simple summary statistics of a whole stream: number of aligned
sequences, average length, GC content.
- Colon separated list of directories searched for genome and
- Temporary space used for sorting of large files.
The system wide configuration file for
popt(3). anfo-tool identifies itself as "anfo-tool" to
Per user configuration file for
The command line of this tools is way too complicated and its semantics are
counterintuitive. Using anfo-tool
is probably best avoided in most
cases, the guile
bindings should provide a much more scalable and
easier to understand interface.
Udo Stenzel <firstname.lastname@example.org>