======Watchdog on WRAP======
======Introduction======
A hardware watchdog is a highly reliable piece of hardware that has a single function; It resets the hardware system if it has not received a specific input for a specific amount of time. Giving input to the watchdog is called 'petting the watchdog'.
The WRAP board Geode processor has a built-in hardware watchdog.
Typical use of the watchdog is to prevent hard-to-reach standalone systems from crashing (hanging) and becoming useless. Typically, these standalone systems are very reliable and are never looked after; a properly configured Linux system can run unmaintained for years without crashing. However, in the rare case that something does crash, a watchdog might bring the system back alive by rebooting it.
Note that the watchdog does not help to reboot systems that have (gotten) defect hardware.
======Watchdog device driver======
Linux provides a device driver called wd1100:
voyage:~# lsmod
Module Size Used by
<...>
wd1100 4560 1
======Watchdog timeout======
The device driver sets the watchdog timeout. The timeout can be configured as follows:
root@nehemiah:/etc# cat /etc/modules
<...>
wd1100 sysctl_wd_graceful=0 sysctl_wd_timeout=60
The device driver (accessible through /dev/watchdog or /dev/wd) must be petted from software within the specified timeout, again and again.
======Watchdog Daemon======
Under (Debian) Linux, a watchdog daemon exists that pets the watchdog, but only if a set of conditions is met.
root@nehemiah:/etc# ls -al /usr/sbin/watchdog
-rwxr-xr-x 1 root root 76568 May 19 2005 /usr/sbin/watchdog
The conditions that must be met typically are:
* temperature within bounds
* system load within bounds
* network reachable
* memory available
* ...
======Watchdog Configuration======
The Linux watchdog daemon can be configured to evaluate these conditions on a periodic basis and only pet the watchdog when everything is OK. The default configuration file is /etc/watchdog.conf and an example looks as follows:
ping = 66.249.93.99 # host that must be reachable at all times (www.google.nl)
interface = eth0 # interface that must receive network traffic
file = /etc/checktemp.sh # binary that must always exist
min-memory = 256 # minimum amount of free memory must be 1 MBytes = 256 * 4kByte page
test-binary = /etc/checktemp.sh # binary that must always return 0
realtime = yes # schedule the watchdog in real-time
priority = 1 # high-priority
For more features, see man watchdog.
=======Temperature Monitoring=======
I have written a small script checktemp.sh that returns 0 if the WRAP board is not too hot:
#!/bin/bash
TEMP_MAX=44
TEMP_FILE=/sys/bus/i2c/devices/0-0048/temp1_input
if [ ! -f $TEMP_FILE ]; then
echo "Can not find temperature reading in $TEMP_FILE. Maybe modprobe lm77."
exit 0
fi
TEMP_MC=`cat $TEMP_FILE`
TEMP=$(($((TEMP_MC + 500)) / 1000))
if [ $TEMP -gt $TEMP_MAX ]; then
echo '${0} reads a temperature of $TEMP degrees; too hot. Returning -4 (see man watchdog)'
exit -4
fi
exit 0
Please review if you like the first 'exit 0', and whether the maximum temperature suits your needs.
* User note: I'm assuming this has something to do with the different wrap hardware but I found my temperature at /sys/bus/i2c/devices/1-0048/temp1_input.
* Alix user note: I don't know how all this works in details, but on my Alix there's a /sys/bus/i2c/devices/xxx/temp1_max, so maybe it would be better to use that value instead of a fixed TEMP_MAX=44. (btw. there's a temp2 as well...check both?).
* Alix sensor is at /sys/bus/i2c/devices/0-004c/temp1_input
====== Final Words ======
* This was written by Leon Woestenberg, inspired by the [[pxe_voyage|PXE Wiki page by Assaf Gordon]]. I copied parts of his Wiki page formatting.
* If you find any mistakes, correct them (this is a Wiki, after all).
* This is free documentation. It is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
* Some of the commands in the guide should be executed as ''root'' - **this might damage you system!** \\ Use at your own risk.