metrics.config - this configuration file is used to define dynamic metrics on
Traffic Server activity. Metrics defined here are available through all normal
means of metrics reporting, including traffic_line and
The configuration file itself is a Lua script. As with normal Lua code, comments
begin with --
, you may declare your own functions, and you may define
Metrics are defined by calling the supplied metric generator functions. There is
one for each supported type, and their parameters are identical:
<typefn> '<name>' [[
<metric generating function body>
In practice, this will look like:
float 'proxy.node.useful_metric' [[
With perhaps something more useful in the body of the metric generator. The
string containing the metric generating function's body (everything between
, which is a multiline literal string in Lua) is stored
and then evaluated as an anonymous function, which will receive a single
argument: the name of the metric (in the example above:
). If necessary, you can capture this parameter
using the ...
operator, which returns the remaining parameters of the
A gauge style metric which will return floating point numbers. Floating point
gauge metrics are appropriate for values which may increase or decrease
arbitrarily (e.g. disk usage, cache hit ratios, average document sizes, and so
A gauge style metric which will return integers. Integer gauge metrics are
appropriate for values which may increase or descrease arbitrarily, and do not
need any decimal components.
A metric which will supply integer only values used almost exclusively to report
on the number of events, whatever they may be, that have occurred. Frequent
uses are the number of requests served, responses by specific HTTP status
codes, the number of failed DNS lookups, and so on.
All dynamic metrics, like their built-in counterparts, exist within a scope
which determines whether they reflect the state of the current Traffic Server
node, or the state of the entire Traffic Server cluster for which the current
node is a member.
The scope of a metric is derived from its name. All metric names begin with
followed by either node.
might be used for the number
of open connections to origin servers on just the current node, whereas
would be the counterpart for
the total open connections to origin servers from all Traffic Server nodes in
the cluster, including the current node. (Note that these names are contrived,
and you are advised to both pick as clear and detailed a metric name as
possible and also to ensure there is no conflict with existing metric names).
Several supporting functions are defined in the default configuration file.
Existing dynamic metrics shipped with metrics.config
make extensive use
of these functions, and your own custom metrics may as necessary, too.
Returns the sum of metric name
for the entire cluster of which the
current node is a member. Memoization is used to avoid additional cost from
calling this function multiple times within a single metrics pass. The
must be a metric within the node scope.
Converts and returns bytes
as megabits (bytes * 8 / 1000000
Converts and returns bytes
as mebibytes (bytes / (1024 * 1024)
Returns the current node's time in milliseconds-from-epoch.
Returns the rate of change over a period of msec
milliseconds for the
metric value of key
(obtained by invoking the function fn
This is accomplished by effectively snapshotting the value of the metric at the
beginning and end of the given period expressed by msec
their difference by 1,000 and dividing that by msec
Returns the rate of change for the past 10 seconds for the metric key
calculated by function fn
. This function simply wraps rate_of
and supplies an msec
value of 10 * 1000
For illustrative purposes, a few of the dynamic metric definitions you may find
in your Traffic Server installation's default metrics.config
explained here. The actual file will contain many more definitions, and of
course you may add your own, as well.
The simplest example is a dynamic node metric which does nothing but return the
current value for an underlying process metric:
counter 'proxy.node.http.user_agents_total_documents_served' [[
This uses the built-in function counter
, which takes two parameters: the
name of the dynamic metric to create followed by the function used to
calculate the value. In this case, the function body is just a return
of the named, underlying process statistic. No calculations, aggregates, or
other processing are performed.
Slightly more involved than just returning a point-in-time value from a given
statistic is calculating the rate of change:
integer 'proxy.node.dns.lookups_per_second' [[
local self = ...
function() return proxy.process.dns.total_dns_lookups end
Similar to the previous example, we are returning another metric's value, but in
this case we do so within a function that we're passing into
. This function, explained earlier, wraps rate_of
which tracks the given metric over a specific interval and returns the average
per-second rate of change, obtaining the values it uses to calculate this rate
by invoking the function passed to it.
A more complicated example involves calculating the variance in the rate of
change of an underlying statistic over a given period of time. This is not an
average of a statistic, nor is it just the raw delta between two samplings of
that statistic, and while inappropriate to know how much
of an event
has occurred, it is useful to know how erratic or unstable the frequency of
that event occurring is.
In other words, a large absolute value indicates a deviance from the usual
pattern of behavior/activity. For example, if your Traffic Server cache (using
the example dynamic metric function below) sees between 10,000 and 10,250
HostDB hits every 10 seconds, the value returned by this metric will remain
fairly small. If all of a sudden 50,000 hits make it to HostDB in the span of
that same averaging interval, this value will increase significantly. This
could then be used to trigger various alerts that something might be up with
HostDB lookups on the Traffic Server cluster.
integer 'proxy.node.hostdb.total_hits_avg_10s' [[
local self = ...
function() return proxy.process.hostdb.total_hits end
The catch is that if the dramatic increase is actually the new norm, the metric
will return to emitting small absolute values again - even though the
statistic underneath is now consistently and significantly higher or lower
than it used to be. If what you are trying to measure, though, is the
stability of a metric that's, long-term, a good thing.
Using a very simplified version of the Traffic Server cache hit reporting, we
can demonstrate taking a metric which expresses the occurrence of one type of
event within a set of possibilities and converting its absolute value into a
ratio of that set's total.
In this example, we assume we have three cache hit states (misses, hits, and
revalidates) and they are tracked in the metrics
. These are not the real metric names in
Traffic Server, and there are much finer grained reporting states available,
but we'll use these for brevity.
float 'proxy.node.cache.hits_ratio' [[
( proxy.node.cache.hits +
When running a Traffic Server cluster of multiple nodes, there are many metrics
which are useful to see at both the node and cluster level. Dynamic metrics
make it very easy to collect the metric's value for every node in the cluster
and return the sum, as seen here with cache connections:
counter 'proxy.cluster.http.cache_current_connections_count' [[
The following resources may be useful when writing dynamic metrics: