colmux - multiplex communications to multiple systems running collectl from a
colmux [-command "collectl-switches... [-p filespec]]" [-address
addr1[,addr2,...]|-addr filename] [-cols col1[,col2...]] | [-column num]
This utility gathers up data generated by collectl from multiple systems and
multiplexes it into a single consolidated format. It runs in essentially 2
distinct modes, the first is known as real-time, because data is retrieved and
displayed in real time. The second is playback mode because data is played
back from existing collectl data files.
There are also 2 general formats for the data being displayed. The first is a
multi-line display in which the data is displayed in the native form that
collectl displays it, except it is sorted by a distint column, essentially
allowing one to see the TOP producers of that data. The second format is a
single line display in which one or more distinct data elements from each
source is displayed on the same line. This latter format is never sorted, but
rather positionally organized by the name of the system that generated it.
Collectl will be then be executed, using any optional switches specified by
-command, on each of the systems specified by -address OR read those addresses
from a file it the target of that switch is a filename rather than a list of
hosts OR on the local system if -address is not specified. See collectl for
details of the various switches. In some cases certain collectl switches will
not make sense in a colmux environment and if chosen will generate an error.
Further, if hosts are specified with -address, they should be a individual
addresses or hostnames separated by commas. In turn, any of them can be in
what those familiar with pdsh would recognize as -w format.
Colmux will then execute the collectl command, gather the results from all
sources for a particular interval and display them one result per line, sorted
by the specified column OR all on the same line in groups specified by -cols.
The number of lines displayed is set to the size of the terminal window by
default, but can be changed using -lines. The one exception is the use of
-nosort which only applies to the playback of existing collectl raw files. In
this mode all records for a particular interval will be displayed and the
sorting bypassed, making this a speedy and convenient mechanism for gathering
all data from all systems in one place for potential further processing.
Colmux will never modify the size of the terminal window so to see more or wider
lines either expand the window or override the number of display lines and run
it again. If the number display lines is set greater then the terminal height
or 0, colmux will no longer overlay the previous window and simply run in a
continuous scrolling mode.
Specify any combination of addresses as
hostnames OR in pdsh -w format OR a filename containing a list of
hostnames/addresses, 1 per line. You MUST have passwordless ssh access to
these nodes. If a different username is required, be sure to specify addresses
in username@host format noting you do not have to have the same username on
each host. If specified, these usernames will override those specified with
the -username switch. rsh access is not supported.
One can specify virtually any collectl command
here, both in real-time or playback mode. Some switches may only be used
during one mode or the other and colmux will usually let you know if you
specify an invalid combination or an otherwise restricted switch. Only those
directly affecting colmux are listed below:
Limit the timeframe for data being played
back, noting you can include both the from and thru times with the --from
switch if you separate then with a hyphen.
This is a "magic" switch in that it
not only tells collectl how to display dates/times (no other options are
permitted using -o other than those from the set [dDTm]), it also tells colmux
how to display dates/times too.
In single line mode, the timestamp will either come from the host system in
real-time mode OR the first host when run in playback mode. This is the most
common use/need for this switch. But be careful in choosing column numbers
with -cols as the position of the data shifts by 1 when time is included and
by 2 if date and time are. Using -test will correctly show the shifted
positions but only if you include -o with the command at the same time you use
In real-time/top mode this switch is not allowed since colmux simply reports the
current time of the system it is running on.
When playing back data multi-line formatted data from one or more files, a
timestamp for each interval is reported, consisting of the time of that
interval. When this switch is included, each line will be tagged with an
appropriate timestamp since on rare occasions they may not necessarily all be
This switch tells colmux to run in playback
mode. The filename should include the directory location and is usually
specified with wild cards, limiting the selected file(s) to a specific date.
When those files are on the same host (-address is not specified), they may be
for multiple hosts, but when the files are on remote hosts they must all be
for be that unique host. If the file specification includes the string TODAY
or YESTERDAY they will be replaced with *yyyymmdd* for that date.
Run collectl in plot-format. This allows one
to specify just about any combination of subsystems since all data is always
displayed on a single line. However, due to the lack of formatting, this also
makes no sense for multi-line displays and is therefore only supported in
Show a brief help message and exit.
By default, colmux set the hostwidth to 8,
unless it sees something wider and for most situations this is sufficient.
However, if one specifies hostnames that are aliases of the longer hostname,
colmux has no way of knowing the real hostlengths until after it starts
receiving data from collectl and the formatting will be off if the hostnames
are longer than the default. To overcome this problem, use this switch to
force the hostname to be wider.
Change the number of lines that are displayed
for each interval in multi-line mode. The default will be determined by the
terminal size returned by the linux resize command if present. If that command
is not present, the size will be initially set to 24. If -lines is greater
than the terminal size or 0, top-like behavior will not be used when in
Single-line format controls the number of lines displayed between headers. A
value of 0 will only display the header one time.
Colmux uses brute-force screen formatting,
that is it generates its own VT100 escape sequences to clear lines and/or move
the cursor. On some occasions you may want to disable this sequences if you
wish to recode the output and do your own post-processing of it. This switch
will do just that.
Sometimes a remote version of collectl is
already using the default socket. This allows one to start another instance
and override that value.
This tells colmux to execute the specified
collectl command either locally or on the first remote system specified by
-address, print the associated header with the selected column(s) highlighted
and also include each column name along with its ordinal number, making it
fairly easy to make sure you've selected the right column(s).
Use this username for ALL ssh commands. It can
be overridden for specific hosts by specifying them with the -address switch
with the desired hostnames.
Display the version and exit. It will also
report if Term::ReadKey is installed and if so what its version number
Playback Mode Specific
The following additional switches only apply to playback mode. There are no
real-time mode specific switches.
Introduce a delay between intervals in
seconds. You can specify fractional values. Not using this switch will cause
the output to be displayed as fast as it can be rendered.
Move the cursor to the home position (upper
left-hand corner) of the display to use a top-like display format. This ONLY
applies to multi-line mode when in playback mode and provides a mechanism for
displaying recorded data in a top-like fashion.
When playing back files for multiple hosts on
the local system, sometimes you do not want to play back ALL the host files.
This filter allows you to specify only those hosts which you want to process.
The format of the list of addresses is specified in the same way as -address
except that you cannot specify a filename.
Intended primarily for output that would be
redirected to a file, do not sort or include any escape sequences in the
When there is more output then will fit on the
screen, colmux includes the text:
Displaying: lines xx thru yy out of zz
on the right-side of the top line of the display, where xx is typically 1.
However, once colmux is running, one might want to look at subsequent lines, ie
those below the bottom of the screen and therefore invisible. If the ReadKey
module is installed, one can simply use the PageDown key to move down the
display and the PageUp key to move in the other direction. If ReadKey is not
installed, typing the multi-key sequences pd<ENTER> or pu<ENTER>
will cause the same thing to happen.
When you wish to change the sort column and
the arrow keys aren't available to you, it may be cumbersome to identify the
number of the column to type in followed by RETURN. This tells colmux to
display the numbers over each column eliminating the need to manually count
them and find the one you want.
Set the sort column to this number. The column
numbering is determined by the columns returned by collectl for the requested
command. Since date/time columns are optional for non-plot data, their
inclusion will change the numbering of the columns so if you are not sure you
selected the correct column, you should first execute your command with -test
You can also change the column number interactively with the RIGHT/LEFT arrow
keys IF the ReadKey module is installed (see colmux -version) OR simply type
it in followed by the <ENTER> key.
There is a real odd case in which you might
want to pipe colmux real-time output to a script for further processing.
However, if you do this you can't read the final line with a routine that
expects a terminating CR, like python's readline(). Rather, that last line and
the one that follows will be returned as one long string. This switch tell
colmux to insert that final CR, which WILL mess up the screen under normal
operations, so be forewarned.
There are times one has long hostnames which
can either take up valuable screen real estate or are simply painful to look
at. This switch may evolve over time and is currently targetted as hostnames
that have repeating parts along with a unique part, separated by a character
such as a hyphen. This switch allows you to specify a single character
followed by the piece of the hostname you'd like to see displayed. For
example, if you have a hostname like aaa-bbbb-cccc-dddd, -hostformat -:3 will
cause the cccc piece to be displayed.
Do not highlight the selected column. This may
be useful when redirecting output to a file and you do not want the associated
escape sequences to be written to it.
Reverse the default sort order. You can also
change the direction of the sort interactively with the UP/DOWN arrow keys IF
the ReadKey module is installed (see colmux -version)
OR simply type the r key and <ENTER>.
Do not display any rows with 0 in the sort
column. You can also type z<ENTER>interactively.
Divide each column by 1000 before
Divide each column by 1024 before
Remap large numbers to a smaller number of
values by taking the log10 of them and further transforming by the followign
mapping: 0,1 to 0, 10 to 10, 100 to 20, 1000 to 30, 10000 to 40, ... 1e9 to
Group all data together for each host by
column number(s). As with -column, you can confirm the correct column(s) have
been selected by first running with -test.
Do not show data for individual hosts, just
display the totals.
Do not divide the specified column numbers by
1000 or 1024 when col1000 or colk or apply the colllog10 transformation when
specified. A typical usage is if you want to look at cpu loads as well as
network or disk stats in which case you may want to divide the latter by 1024
but not the cpu.
Do no include instance portion (and
surrounding brackets) in totals column headers.
Include the totals for each column to the
Set the output columns to this width,
typically used in conjunction with -col1000 or colk to allow more hosts to fit
onto the same line. It can also be used if the host names are too narrow for
column headers and you have room to display wider names.
Exception Reporting Specific
In single-line format, rather than wait for all hosts to report their data,
colmux simply reports the last data seen when the time to generate a line of
output has come. In most cases, these do reflect the most recent data values
but in times of load, the data may be late getting to colmux and so a previous
value may be reported. If the age of that data exceeds a defined number of
intervals, the default is currently 2, an exception value will be reported of
-1. At other times it has been seen where kernel/driver bugs may cause
incorrect values to be reported as negative numbers and those values are also
reported as -1. Both the age and exception values can be changed with the
When initially starting up and all hosts have
not yet reported any data, colmux will display a -1 to indicate no data has
been seen yet. If during processing a host fails to report in -age intervals,
the default is 2, colmux will also report a -1 indicating the data is
In some cases, there could be erroneous data
reported as negative numbers (though sometimes negative numbers are valid).
When specified, replace any negative numbers with this value.
This switch allows you to change the -1 that
is normally reported for missing or stale data to the specified value, most
The following switches are intended more for diagnostic purposes than normal
operation, though are also worth using on appropriate occasions.
This switch is for generating diagnostic
information at various levels. It is actually a bit mask, whose values are
listed in the beginning on colmux itself. Perhaps the most useful value is 1
as it will cause colmux to display all the remote commands issues to each host
in the address list and can often reveal problems when things don't seem to be
This switch was initially included in an
earlier version when remote host checking was causing problem in some cases
and by skipping those checks, colmux would run more reliably. While it is felt
that as of V3.2.0 these reachability checks are now reliable and should not be
skipped, this switch has been left in place.
By default and when -nocheck not specified,
colmux checks the versions of all collectl instances against that of the first
node found to be running collectl and if different, reports the mismatch. This
switch suppresses that warning.
When a connection is received from an unexpected address, a warning is also
reported and the request promptly ignored. This switch also suppresses those
messages as well. For more information on problems connecting, see CONNECTION
By default, when a node is found to not be
reachable, colmux will remove it from its list of hosts and continue
execution. This switch will tell colmux to exit when all hosts are not
There are 2 switches whose descriptions don't really fit anywhere else:
On rare occasions, such as testing a patch to
collectl in a copy NOT in /usr/bin, you may want to tell colmux to use that
copy instead of the standard one. Use this switch to point to that copy.
Naturally that copy must exist in that location on all systems.
Colmux uses ssh to start collectl on each
remote machine and then communications between collectl and colmux occur over
a socket. Normally, ssh is configured to timeout after an interval of
inactivity, such as 30 minutes, which means a long-running colmux session will
begin to lose connections when this interval is reached. By specifying a
keepalive interval, you're telling the ssh to send a periodic keepalive to the
other end so that connection doesn't get dropped.
Tell remote collectls to open a socket on this
address instead of the preselected one. For more details on this, see
By default, collectl waits up to 10 seconds
for remote instances of collectl to connect back. On slower networks or when a
very large number of instances have been started, they may fail to connect
back in time. This switch will extend that timeout, but it also requires
collectl V3.6.4 be used because earlier version do not support this
When colmux starts up and checks the
connectivity to all the machines specified by -addr, it also gets their
current date/time and using that computes the range of system times across all
nodes. If that time is found to be more then -timerange seconds, colmux
generates a warning as this difference could cause reporting probems. One can
increase the range to get rid of the message (not recommended unless other
factors are preventing nodes from responding quickly enough to the date
command) OR suppress the warning with -quiet.
All logs being played back must have been collected using the same interval as
colmux only looks at the first file/host to determine the appropriate value.
It is assumed all clocks are reasonably well synchronized as colmux uses time to
determine which data is to be displayed as a set.
All files must be in the same directory on all systems and that directory must
be included in the playback file specification
All files on a remote host must be for that host only
Run collectl on 3 nodes, showing CPU, Disk and Network statistics once a second
and sorted by column 1, which happens to be total cpu.
colmux -addr abc,def,xyz
Dynamically display top processes on nodes n1-n10 of a cluster once a second,
sorted by column 5.
colmux -addr n[1-10] -command "-sZ :1" -column 5
Do the same for yesterday, between the hours of 5AM and 6AM, being sure to stall
for 1/2 second between intervals. Note, if you leave off -addr you could put
all the logs into /var/log/collectl on the local host and play them back from
colmux -addr n[1-10] -command "-sZ -p/var/log/collectl/YESTERDAY -from
05:00-06:00" -column 5 -delay .5
Look at the amount of mapped and slab memory consumed on nodes n1-n10 and n15 in
real-time, every 2 seconds using single-line format. Include totals and
preface each line with the time. Since memory sizes tend to be rather large,
divide each by 1024 so we see MB rather than KB. Note that the columns numbers
are always displayed are ascending order regardless of their order in -cols.
To be sure, first test the column numbers.
colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols 6,7 -coltot
colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols 6,7 -coltot
Display most active disks, based on KB written, on nodes n1, n4 and n5.
colmux -addr n1,n4,n5 -command "-sD" -column 6
Here is a cool trick. Collectl currently lets you look at top processes with the
--top switch and even choose a sort column by name. However, if you want to
change the column you need to exit, then rerun collectl with a different sort
column name. But if you run it like this example, you get the power of colmux
to dynamically change the sort columns with the arrow keys! You can also use
this technique to have collectl dynamically sort any local multi-line data
such as slabs or even detail data like CPU, Disk, Lustre and Networks too!
Naturally this technique works just as well with playing back data as well.
colmux -command "-sZ -i:1"
colmux requires passwordless ssh between the node it is running on those it is
monitoring. also be sure the port you are using for communications, the
default is 2655, if open
The way colmux works is to choose an address it wants to communicate over and
starts up one or more remote copies of collectl, telling them to connect back
to colmux using that address. The easiest way to see this, is to run colmux
with -noesc, which tells it NOT to issue any escape sequences and therefore
not to run in full screen mode. The addional switch of -debug 1 tells it to
show the remote collectl startup command. When there is a communications
problem you will typically see 'connection timed out' messages displayed.
There are actually a couple of possibilities here, one of which is a firewall is
preventing connections and the easiest way to test this is run collectl on the
local machine like this: collectl -Aserver. This tells collectl run as a
server, listening for connections just like colmux. Then log into a remote
machine and run /usr/share/collectl/util/client.pl addr-of-server which tells
client.pl to open a socket to that copy of collectl. It should fail just like
when it was run via colmux, so try opening the firewall and try it again. If
it fixes the problem, it was indeed the firewall blocking things and colmux
should now work just fine.
Sometimes there are multiple interfaces defined on the machine hosting colmux
and in some cases only some addresses will allow socket connections. Again,
using client.pl on the remote machine try connecting back to collectl over
different addresses and when you find one that works, tell colmux to use that
address for communication via the -retaddr switch.
This program was written by Mark Seger (firstname.lastname@example.org).
Copyright 2015 Hewlett-Packard Development Company, L.P.