======Watchdog on WRAP====== ======Introduction====== A hardware watchdog is a highly reliable piece of hardware that has a single function; It resets the hardware system if it has not received a specific input for a specific amount of time. Giving input to the watchdog is called 'petting the watchdog'. The WRAP board Geode processor has a built-in hardware watchdog. Typical use of the watchdog is to prevent hard-to-reach standalone systems from crashing (hanging) and becoming useless. Typically, these standalone systems are very reliable and are never looked after; a properly configured Linux system can run unmaintained for years without crashing. However, in the rare case that something does crash, a watchdog might bring the system back alive by rebooting it. Note that the watchdog does not help to reboot systems that have (gotten) defect hardware. ======Watchdog device driver====== Linux provides a device driver called wd1100: voyage:~# lsmod Module Size Used by <...> wd1100 4560 1 ======Watchdog timeout====== The device driver sets the watchdog timeout. The timeout can be configured as follows: root@nehemiah:/etc# cat /etc/modules <...> wd1100 sysctl_wd_graceful=0 sysctl_wd_timeout=60 The device driver (accessible through /dev/watchdog or /dev/wd) must be petted from software within the specified timeout, again and again. ======Watchdog Daemon====== Under (Debian) Linux, a watchdog daemon exists that pets the watchdog, but only if a set of conditions is met. root@nehemiah:/etc# ls -al /usr/sbin/watchdog -rwxr-xr-x 1 root root 76568 May 19 2005 /usr/sbin/watchdog The conditions that must be met typically are: * temperature within bounds * system load within bounds * network reachable * memory available * ... ======Watchdog Configuration====== The Linux watchdog daemon can be configured to evaluate these conditions on a periodic basis and only pet the watchdog when everything is OK. The default configuration file is /etc/watchdog.conf and an example looks as follows: ping = 66.249.93.99 # host that must be reachable at all times (www.google.nl) interface = eth0 # interface that must receive network traffic file = /etc/checktemp.sh # binary that must always exist min-memory = 256 # minimum amount of free memory must be 1 MBytes = 256 * 4kByte page test-binary = /etc/checktemp.sh # binary that must always return 0 realtime = yes # schedule the watchdog in real-time priority = 1 # high-priority For more features, see man watchdog. =======Temperature Monitoring======= I have written a small script checktemp.sh that returns 0 if the WRAP board is not too hot: #!/bin/bash TEMP_MAX=44 TEMP_FILE=/sys/bus/i2c/devices/0-0048/temp1_input if [ ! -f $TEMP_FILE ]; then echo "Can not find temperature reading in $TEMP_FILE. Maybe modprobe lm77." exit 0 fi TEMP_MC=`cat $TEMP_FILE` TEMP=$(($((TEMP_MC + 500)) / 1000)) if [ $TEMP -gt $TEMP_MAX ]; then echo '${0} reads a temperature of $TEMP degrees; too hot. Returning -4 (see man watchdog)' exit -4 fi exit 0 Please review if you like the first 'exit 0', and whether the maximum temperature suits your needs. * User note: I'm assuming this has something to do with the different wrap hardware but I found my temperature at /sys/bus/i2c/devices/1-0048/temp1_input. * Alix user note: I don't know how all this works in details, but on my Alix there's a /sys/bus/i2c/devices/xxx/temp1_max, so maybe it would be better to use that value instead of a fixed TEMP_MAX=44. (btw. there's a temp2 as well...check both?). * Alix sensor is at /sys/bus/i2c/devices/0-004c/temp1_input ====== Final Words ====== * This was written by Leon Woestenberg, inspired by the [[pxe_voyage|PXE Wiki page by Assaf Gordon]]. I copied parts of his Wiki page formatting. * If you find any mistakes, correct them (this is a Wiki, after all). * This is free documentation. It is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. * Some of the commands in the guide should be executed as ''root'' - **this might damage you system!** \\ Use at your own risk.