axe - axe Documentation
Axe is a read de-multiplexer, useful in situations where sequence reads contain
the barcodes that uniquely distinguish samples. Axe uses a rapid and accurate
algorithm based on hamming mismatch tries to competitively match the prefix of
a sequencing read against a set of barcodes. Axe supports combinatorial
For arcane reasons, the name of the axe
binary changed to axe-demux with version 0.3.0. Apologies for the
inconvenience, this was required to make axe installable in Debian and
its derivatives. Command-line usage did not change.
Axe has several usage modes. The primary distinction is between the two
alternate barcoding schemes, single and combinatorial barcoding. Single
barcode matching is used when only the first read contains barcode sequences.
Combinatorial barcoding is used when both reads in a read pair contain
independent (typically different) barcode sequences.
For concise reference, the command-line usage of axe-demux
axe-demux [-mzc2pt] -b (-f [-r] | -i) (-F [-R] | -I)
-m, --mismatch Maximum hamming distance mismatch. [int, default 1]
-z, --ziplevel Gzip compression level, or 0 for plain text [int, default 0]
-c, --combinatorial Use combinatorial barcode matching. [flag, default OFF]
-p, --permissive Don't error on barcode mismatch confict, matching only
exactly for conficting barcodes. [flag, default OFF]
-2, --trim-r2 Trim barcode from R2 read as well as R1. [flag, default OFF]
-b, --barcodes Barcode file. See --help for example. [file]
-f, --fwd-in Input forward read. [file]
-F, --fwd-out Output forward read prefix. [file]
-r, --rev-in Input reverse read. [file]
-R, --rev-out Output reverse read prefix. [file]
-i, --ilfq-in Input interleaved paired reads. [file]
-I, --ilfq-out Output interleaved paired reads prefix. [file]
-t, --table-file Output a summary table of demultiplexing statistics to file. [file]
-h, --help Print this usage plus additional help.
-V, --version Print version string.
-v, --verbose Be more verbose. Additive, -vv is more vebose than -v.
-q, --quiet Be very quiet.
Regardless of read mode, three input and output schemes are supported:
single-end reads, paired reads (separate R1 and R2 files) and interleaved
paired reads (one file, with R1 and R2 as consecutive reads). If single end
reads are inputted, they must be output as single end reads. If either paired
or interleaved paired reads are read, they can be output as either paired
reads or interleaved paired reads. This applies to both successfully
de-multiplexed reads and reads that could not be de-multiplexed.
flag can be used to specify that outputs should be compressed
using gzip compression. The -z
flag takes an integer argument between 0
(the default) and 9, where 0 indicates plain text output ( gzopen
"wT"), and 1-9 indicate that the respective compression level should
be used, where 1 is fastest and 9 is most compact.
The output flags should be prefixes that are used to generate the output file
name based on the barcode's (or barcode pair's) ID. The names are generated
+ barcode ID
. The output file for reads that could not be
demultiplexed is prefix
. The read number is omitted unless the
paired read file scheme is used, and is "il" for interleaved output.
The extension is "fastq"; ".gz" is appended to the
extension if the -z
flag is used.
- The corresponding CLI flags are:
- -f and -F: Single end or paired R1 file input
and output respectively.
- -r and -R: Paired R2 file input and
- -i and -I: Interleaved paired input and
The barcode file is a tab-separated file with an optional header. It is
mandatory, and is always supplied using the -b
command line flag. The
exact format is dependent on barcoding mode, and is described further in the
sections below. If a header is present, the header line must start with either
, or it will be interpreted as a barcode line,
leading to a parsing error. Any line starting with ';' or '#' is ignored,
allowing comments to be added in line with barcodes. Please ensure that the
software used to produce the barcode uses ASCII encoding, and does not insert
a Byte-order Mark (BoM) as many text editors can silently use Unicode-based
encoding schemes. I recommend the use of LibreOffice Calc
(part of a
free and open source office suite) to generate barcode tables; Microsoft Excel
can also be used.
Independent of barcode mode, the -m
flag is used to select the maximum
allowable hamming distance between a read's prefix and a barcode to be
considered as a match. As "mutated" barcodes must be unique, a
hamming distance of one is the default as typically barcodes are designed to
differ by a hamming distance of at least two. Optionally, (using the -p
flag), axe will allow selective mismatch levels, where, if clashes are
observed, the barcode will only be matched exactly. This allows one to process
datasets with barcodes that don't have a sufficiently high distance between
Single barcode mode is the default mode of operation. Barcodes are matched
against read one (hereafter the forward read), and the barcode is trimmed from
only the forward read, unless the -2
command line flag is given, in
which case a prefix the same length as the matched barcode is also trimmed
from the second or reverse read. Note that sequence of this second read is not
checked before trimming.
In single barcode mode, the barcode file has two columns: Barcode
Combinatorial barcode mode is activated by giving the -c
flag on the
command line. Forward read barcodes are matched against the forward read, and
reverse read barcodes are matched against the reverse read. The optimal
barcodes are selected independently, and the barcode pair is selected from
these two barcodes. The respective barcodes are trimmed from both reads; the
command line flag has no effect in combinatorial barcode mode.
In combinatorial barcode mode, the barcode file has three columns:
. Individual barcodes can occur
many times within the forward and reverse barcodes, but barcode pairs must be
option allows the output of per-sample read counts to a
tab-separated file. The file will have a header describing its format, and
includes a line for unbarcoded reads.
Axe uses an algorithm based on longest-prefix-in-trie matching to match a
variable length from the start of each read against a set of 'mutated'
While for most applications in high-throughput sequencing hamming distances are
a frowned-upon metric, it is typical for HTS read barcodes to be designed to
tolerate a certain level of hamming mismatches. Given these sequences are
short and typically occur at the 5' end of reads, insertions and deletions
rarely need be considered, and the increased rate of assignment of reads with
many errors is offset by the risk of falsely assigning barcodes to an
incorrect sample. In any case, reads with more than 1-2 sequencing errors in
their first several bases are likely to be poor quality, and will simply be
filtered out during downstream quality control.
Typically, reads are matched to a set of barcodes by calculating the hamming
distance between the barcode, and the first l bases of a read for a barcode of
length l. The "correct" barcode is then selected by recording either
the barcode with the lowest hamming distance to the read (competitive
matching) or by simply accepting the first barcode with a hamming distance
below a certain threshold. These approaches are both very computationally
expensive, and can have lower accuracy than the algorithm I propose.
Additionally, implementations of these methods rarely handle barcodes of
differing length and combinatorial barcoding well, if at all.
Central to Axe's algorithm is the concept of hamming-mismatch tries. A trie is a
N-ary tree for an N letter alphabet. In the case of high-throughput sequencing
reads, we have the alphabet AGCT
, corresponding to the four nucleotides
of DNA, plus N
, used to represent ambiguous base calls. Instead of
matching each barcode to each read, we pre-calculate all allowable sequences
at each mismatch level, and store these in level-wise tries. For example, to
match to a hamming distance of 2, we create three tries: One containing all
barcodes, verbatim, and two tries where every sequence within a hamming
distance of 1 and 2 of each barcode respectively. Hereafter, these tries are
referred to as the 0, 1 and 2-mm tries, for a hamming distance (mismatch) of
0, 1 and 2. Then, we find the longest prefix in each sequence read in the 0mm
trie. If this prefix is not a valid leaf in the 0mm trie, we find the longest
prefix in the 1mm trie, and so on for all tries in ascending order. If no
prefix of the read is a complete sequence in any trie, the read is assigned to
an "non-barcoded" output file.
This algorithm ensures optimal barcode matching in many ways, but is also
extremely fast. In situations with barcodes of differing length, we ensure
that the longest
acceptable barcode at a given hamming distance is
chosen; assuming that sequence is random after the barcode, the probability of
false assignments using this method is low. We also ensure that short perfect
matches are preferred to longer inexact matches, as we firstly only consider
barcodes with no error, then 1 error, and so on. This ensures that reads with
barcodes that are followed by random sequence that happens to inexactly match
a longer barcode in the set are not falsely assigned to this longer barcode.
The speed of this algorithm is largely due to the constant time matching
algorithm with respect to the number of barcodes to match. The time taken to
match each read is proportional instead to the length of the barcodes, as for
a barcode of length l, at most l + 1 trie level descents are required to find
an entry in the trie. As this length is more-or-less constant and small, the
overall complexity of axe's algorithm is O(n) for n reads, as opposed to O(nm)
for n reads and m barcodes as is typical for traditional matching algorithms
2017, Kevin Murray