watchdog - a software watchdog daemon
The Linux kernel can reset the system if serious problems are detected. This can
be implemented via special watchdog hardware, or via a slightly less reliable
software-only watchdog inside the kernel. Either way, there needs to be a
daemon that tells the kernel the system is working fine. If the daemon stops
doing that, the system is reset.
is such a daemon. It opens /dev/watchdog
, and keeps
writing to it often enough to keep the kernel from resetting, at least once
per minute. Each write delays the reboot time another minute. After a minute
of inactivity the watchdog hardware will cause the reset. In the case of the
software watchdog the ability to reboot will depend on the state of the
machines and interrupts.
The watchdog daemon can be stopped without causing a reboot if the device
is closed correctly, unless your kernel is compiled with
The watchdog daemon does several tests to check the system status:
- Is the process table full?
- Is there enough free memory?
- Is there enough allocatable memory?
- Are some files accessible?
- Have some files changed within a given interval?
- Is the average work load too high?
- Has a file table overflow occurred?
- Is a process still running? The process is specified by a
- Do some IP addresses answer to ping?
- Do network interfaces receive traffic?
- Is the temperature too high? (Temperature data not always
- Execute a user defined command to do arbitrary tests.
- Execute one or more test/repair commands found in
/etc/watchdog.d. These commands are called with the argument test
If any of these checks fail watchdog will cause a shutdown. Should any of these
tests except the user defined binary last longer than one minute the machine
will be rebooted, too.
Available command line options are the following:
- -v, --verbose
- Set verbose mode. Only implemented if compiled with
SYSLOG feature. This mode will log each several infos in
LOG_DAEMON with priority LOG_DEBUG. This is useful if you
want to see exactly what happened until the watchdog rebooted the system.
Currently it logs the temperature (if available), the load average, the
change date of the files it checks and how often it went to sleep.
- -s, --sync
- Try to synchronize the filesystem every time the process is
awake. Note that the system is rebooted if for any reason the
synchronizing lasts longer than a minute.
- -b, --softboot
- Soft-boot the system if an error occurs during the main
loop, e.g. if a given file is not accessible via the stat(2) call.
Note that this does not apply to the opening of /dev/watchdog and
/proc/loadavg, which are opened before the main loop starts. Now
this is implemented by disabling the error re-try timer.
- -F, --foreground
- Run in foreground mode, useful for running under systemd
- -f, --force
- Force the usage of the interval given or the maximal load
average given in the config file. Without this option these values are
- -c config-file, --config-file
- Use config-file as the configuration file instead of
the default /etc/watchdog.conf.
- -q, --no-action
- Do not reboot or halt the machine. This is for testing
purposes. All checks are executed and the results are logged as usual, but
no action is taken. Also your hardware card or the kernel software
watchdog driver is not enabled. NOTE: This still allows 'repair' actions
to run, but the daemon itself will not attempt a reboot.
- -X num, --loop-exit num
- Run for 'num' loops then exit as if SIGTERM was received.
Intended for test/debug (e.g. using valgrind for checking memory
access). If the daemon exits on a loop counter and you have the
CONFIG_WATCHDOG_NOWAYOUT option compiled for the kernel or
device-driver then an unplanned reboot will follow - be warned!
starts, it puts itself into the background and then tries
all checks specified in its configuration file in turn. Between each two tests
it will write to the kernel device to prevent a reset. After finishing all
tests watchdog goes to sleep for some time. The kernel drivers expects a write
to the watchdog device every minute. Otherwise the system will be reset.
will sleep for a configure interval that defaults to 1 second
to make sure it triggers the device early enough.
Under high system load watchdog
might be swapped out of memory and may
fail to make it back in in time. Under these circumstances the Linux kernel
will reset the machine. To make sure you won't get unnecessary reboots make
sure you have the variable realtime
set to yes
configuration file watchdog.conf
. This adds real time support to
: it will lock itself into memory and there should be no
problem even under the highest of loads.
On system running out of memory the kernel will try to free enough memory by
killing process. The watchdog
daemon itself is exempted from this
so-called out-of-memory killer.
Also you can specify a maximal allowed load average. Once this load average is
reached the system is rebooted. You may specify maximal load averages for 1
minute, 5 minutes or 15 minutes. The default values is to disable this test.
Be careful not to set this parameter too low. To set a value less then the
predefined minimal value of 2, you have to use the -f
You can also specify a minimal amount of virtual memory you want to have
available as free. As soon as more virtual memory is used action is taken by
. Note, however, that watchdog does not distinguish between
different types of memory usage. It just checks for free virtual memory.
If you have a machine with temperature sensor(s) you can specify the maximal
allowed temperature. Once this temperature is reached on any sensor the system
is powered off. The default value is 90 C. Typically the temperature
information is provided by the sensors
package as files in the virtual
filesystem /sys/device and can be found using, for example, the command
find /sys -name 'temp*input' -print
These files hold the temperature in milli-Celsius. You can have multiple sensors
used in the config file. For example to change to 75C maximum and to check two
virtual files for the system temperature you might have this:
max-temperature = 75
temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input
will issue warnings once the temperature increases 90%, 95%
and 98% of the configured maximum temperature.
When using file mode watchdog
will try to stat
(2) the given files.
Errors returned by stat will not
cause a reboot. For a reboot the stat
call has to last at least the re-try time-out value (default 1 minute). This
may happen if the file is located on an NFS mounted filesystem. If your system
relies on an NFS mounted filesystem you might try this option. However, in
such a case the sync
option may not work if the NFS server is not
can read the pid from a pid file and see whether the process
still exists. If not, action is taken by watchdog
. So you can for
instance restart the server from your repair-binary
will try periodically to fork itself to see whether the process
table is full. This process will leave a zombie process until watchdog wakes
up again and catches it; this is harmless, don't worry about it.
In ping mode watchdog
tries to ping the given IPv4 addresses. These
addresses do not have to be a single machine. It is possible to ping to a
broadcast address instead to see if at least one machine in a subnet is still
Do not use this broadcast ping unless your MIS person a) knows about it
and b) has given you explicit permission to use it!
will send out three ping packages and wait up to
<interval> seconds for the reply with <interval> being the time it
goes to sleep between two times triggering the watchdog device. Thus a
unreachable network will not cause a hard reset but a soft reboot.
You can also test passively for an unreachable network by just monitoring a
given interface for traffic. If no traffic arrives the network is considered
unreachable causing a soft reboot or action from the repair binary.
can run an external command for user-defined tests. A return
code not equal 0 means an error occurred and watchdog should react. If the
external command is killed by an uncaught signal this is considered an error
by watchdog too. The command may take longer than the time slice defined for
the kernel device without a problem. However, error messages are generated
into the syslog facility. If you have enabled softboot on error the machine
will be rebooted if the binary doesn't exit in half the time watchdog
sleeps between two tries triggering the kernel device.
If you specify a repair binary it will be started instead of shutting down the
system. If this binary is not able to fix the problem watchdog
still cause a reboot afterwards.
If the machine is halted an email is sent to notify a human that the machine is
going down. Starting with version 4.4 watchdog
will also notify the
human in charge if the machine is rebooted.
The re-try timer applies to most errors, except reset/reboot calls and too hot.
It allows a given error source to recover, and treats most tests in this way.
Exceptions are file handle test, load averages, and system memory. If set to
the minimum time of 1 second it will still allow a single re-try at any
polling interval of the system.
A soft reboot (i.e. controlled shutdown and reboot) is initiated for every error
that is found. Since there might be no more processes available, watchdog does
it all by himself. That means:
- Kill all processes with SIGTERM.
- After a short pause kill all remaining processes with
- Record a shutdown entry in wtmp.
- Save the random seed from /dev/urandom. If the
device is non-existant or there is no filename for saving this step is
- Turn off accounting.
- Turn off quota and swap.
- Unmount all partitions except the root partition.
- Remount the root partition read-only.
- Shut down all network interfaces.
- Finally reboot.
If the return code of the check binary is not zero watchdog
an error and reboot the system. Be careful with this if you are using the
real-time properties of watchdog since watchdog
will wait for the
return of this binary before proceeding. An exit code smaller than 245 is
interpreted as an system error code (see errno.h
for details). Values
of 245 or larger than are special to watchdog
- 255 (based on -1 as unsigned 8-bit number)
- Reboot the system. This is not exactly an error message but
a command to watchdog. If the return code is this the
watchdog will not try to run a shutdown script instead.
- Reset the system. This is not exactly an error message but
a command to watchdog. If the return code is this the
watchdog will attempt to hard-reset the machine without attempting
any sort of orderly stopping of process, unmounting of file systems,
- Maximum load average exceeded.
- The temperature inside is too high.
- /proc/loadavg contains no (or not enough) data.
- The given file was not changed in the given interval.
- /proc/meminfo contains invalid data.
- Child process was killed by a signal.
- Child process did not return in time.
- Free for personal watchdog-specific use (was -10 as an
unsigned 8-bit number).
- Reserved for an unknown result, for example a slow
background test that is still running so neither a success nor an
The repair binary is started with one parameter: the error number that caused
to initiate the boot process. After trying to repair the
system the binary should exit with 0 if the system was successfully repaired
and thus there is no need to boot anymore. A return value not equal 0 tells
to reboot. The return code of the repair binary should be the
error number of the error causing watchdog
to reboot. Be careful with
this if you are using the real-time properties since watchdog
for the return of this binary before proceeding.
The configuration file parameter repair-maximum
controls the number of
successive repair attempts that report 0 (i.e. success) but fail to clear the
tested fault. If this is exceeded then a reboot takes place. If set to zero
then a reboot can always be blocked by the repair program reporting success.
Executables placed in the test directory are discovered by watchdog on startup
and are automatically executed. They are bounded time-wise by the test-timeout
directive in watchdog.conf.
These executables are called with either "test" as the first argument
(if a test is being performed) or "repair" as the first argument (if
a repair for a previously-failed "test" operation on is being
The as with test binaries and repair binaries, expected exit codes for a
successful test or repair operation is always zero.
If an executable's test operation fails, the same executable is automatically
called with the "repair" argument as well as the return code of the
previously-failed test operation.
For example, if the following execution returns 42:
The watchdog daemon will attempt to repair the problem by calling:
/etc/watchdog.d/my-test repair 42
This enables administrators and application developers to make intelligent
test/repair commands. If the "repair" operation is not required (or
is not likely to succeed), it is important that the author of the command
return a non-zero value so the machine will still reboot as expected.
Note that the watchdog daemon may interpret and act upon any of the reserved
return codes noted in the Check Binary section prior to calling a given
command in "repair" mode.
As for the repair binary, the configuration parameter repair-maximum
controls the number of successive repair attempts that report success (return
0) but fail to clear the fault.
None known so far.
The original code is an example written by Alan Cox
<firstname.lastname@example.org>, the author of the kernel driver. All
additions were written by Michael Meskes <email@example.com>. Johnie
Ingram <firstname.lastname@example.org> had the idea of testing the load average. He
also took over the Debian specific work. Dave Cinege
<email@example.com> brought up some hardware watchdog issues and
helped testing this stuff.
- The watchdog device.
- The pid file of the running watchdog.