📄 watchdog.8
字号:
.TH WATCHDOG 8 "January 2005".UC 4.SH NAMEwatchdog \- a software watchdog daemon.SH SYNOPSIS.B watchdog.RB [ \-f | \-\-force ].RB [ \-c " \fIfilename\fR|" \-\-config\-file " \fIfilename\fR]".RB [ \-v | \-\-verbose ].RB [ \-s | \-\-sync ].RB [ \-b | \-\-softboot ] .RB [ \-q | \-\-no\-action ].SH DESCRIPTIONThe Linux kernel can reset the system if serious problems are detected.This can be implemented via special watchdog hardware, or via a slightlyless reliable software-only watchdog inside the kernel. Either way, thereneeds to be a daemon that tells the kernel the system is working fine. If thedaemon stops doing that, the system is reset..PP.B watchdog is such a daemon. It opens.IR /dev/watchdog , and keeps writing to it often enough to keep the kernel from resetting,at least once per minute. Each write delays the reboottime another minute. After a minute of inactivity the watchdog hardware willcause the reset. In the case of the software watchdog the ability to reboot will depend on the state of the machines and interrupts..PPThe watchdog daemon can be stopped without causing a reboot if the device .I /dev/watchdogis closed correctly, unless your kernel is compiled with the.I CONFIG_WATCHDOG_NOWAYOUToption enabled..SH TESTSThe watchdog daemon does several tests to check the system status:.IP \(bu 3Is the process table full?.IP \(bu 3Is there enough free memory?.IP \(bu 3Are some files accessible?.IP \(bu 3Have some files changed within a given interval?.IP \(bu 3Is the average work load too high?.IP \(bu 3Has a file table overflow occurred?.IP \(bu 3Is a process still running? The process is specified by a pid file..IP \(bu 3Do some IP addresses answer to ping?.IP \(bu 3Do network interfaces receive traffic?.IP \(bu 3Is the temperature too high? (Temperature data not always available.).IP \(bu 3Execute a user defined command to do arbitrary tests..PPIf any of these checks fail watchdog will cause a shutdown. Should any ofthese tests except the user defined binary last longer than one minute themachine will be rebooted, too..PP.SH OPTIONSAvailable command line options are the following:.TP.BR \-v ", " \-\-verboseSet verbose mode. Only implemented if compiled with .I SYSLOG feature. Thismode will log each several infos in .I LOG_DAEMON with priority .IR LOG_INFO.This is useful if you want to see exactly what happened until the watchdog rebootedthe system. Currently it logs the temperature (if available), the loadaverage, the change date of the files it checks and how often it went to sleep..TP.BR \-s ", " \-\-syncTry to synchronize the filesystem every time the process is awake. Note thatthe system is rebooted if for any reason the synchronizing lasts longerthan a minute..TP.BR \-b ", " \-\-softbootSoft-boot the system if an error occurs during the main loop, e.g. if a given file is not accessible via the .BR stat (2)call. Note thatthis does not apply to the opening of .I /dev/watchdog and .IR /proc/loadavg ,which are opened before the main loop starts..TP.BR \-f ", " \-\-forceForce the usage of the interval given or the maximal load average given in the config file..TP.BR \-c " \fIconfig-file\fR, " \-\-config\-file " \fIconfig-file"Use.I config-fileas the configuration file instead of the default .IR /etc/watchdog.conf ..TP.BR \-q ", " \-\-no\-actionDo not reboot or halt the machine. This is for testing purposes. All checksare executed and the results are logged as usual, but no action is taken.Also your hardware card or the kernel software watchdog driver is notenabled. Temperature checking is also disabled since this triggersthe hardware watchdog on some cards..SH FUNCTIONAfter.B watchdog starts, it puts itself into the background and then tries all checksspecified in its configuration file in turn. Between each two tests it will write tothe kernel device to prevent a reset. After finishing all tests watchdog goes to sleep for sometime. The kernel drivers expects a write to the watchdog device every minute.Otherwise the system will be reset. As a default .B watchdog will sleep foronly 10 seconds so it triggers the device early enough..PPUnder high system load .B watchdog might be swapped out of memory and may failto make it back in in time. Under these circumstances the Linux kernel willreset the machine. To make sure you won't get unnecessary reboots makesure you have the variable .I realtimeset to .I yes in the configuration file.IR watchdog.conf . This adds real time support to .BR watchdog :it will lockitself into memory and there should be no problem even under the highest ofloads..PPAlso you can specify a maximal allowed load average. Once this load averageis reached the system is rebooted. You may specify maximal load averages for1 minute, 5 minutes or 15 minutes. The default values is to disable thistest. Be careful not to set this parameter too low. To set a value less thenthe predefined minimal value of 2, you have to use the .B -f option..PPYou can also specify a minimal amount of virtual memory you want to haveavailable as free. As soon as more virtual memory is used action is taken by.BR watchdog . Note, however, that watchdog does not distinguish betweendifferent types of memory usage. It just checks for free virtual memory..PPIf you have a watchdog card with temperature sensor you can specify the maximal allowed temperature. Once this temperature is reached thesystem is halted. The default value is 120. There is no unit conversion so makesure you use the same unit as your hardware. .B watchdog will issue warningsonce the temperature increases 90%, 95% and 98% of this temperature..PPWhen using file mode .B watchdog will try to.BR stat (2)the given files. Errors returnedby stat will .B notcause a reboot. For a reboot the stat call has to last at least one minute.This may happen if the file is located on an NFS mounted filesystem. If yoursystem relies on an NFS mounted filesystem you might try this option.However, in such a case the .I sync option may not work if the NFS server isnot answering..PP.B watchdogcan read the pid from a pid file and see whether the process still exists. If not, action is takenby .BR watchdog . So you can for instance restart the server from your.IR repair-binary ..PP.B watchdog will try periodically to fork itself to see whether the processtable is full. This process will leave a zombie process until watchdog wakesup again and catches it; this is harmless, don't worry about it..PPIn ping mode .B watchdog tries to ping the given IP addresses. These addresses donot have to be a single machine. It is possible to ping to a broadcastaddress instead to see if at least one machine in a subnet is still living..PP.B Do not use this broadcast ping unless your MIS person a) knows about it and.B b) has given you explicit permission to use it!.PP.B watchdog will send out three ping packages and wait up to <interval> secondsfor the reply with <interval> being the time it goes to sleep between twotimes triggering the watchdog device. Thus a unreachable network will notcause a hard reset but a soft reboot..PPYou can also test passively for an unreachable network by just monitoringa given interface for traffic. If no traffic arrives the network isconsidered unreachable causing a soft reboot or action from the repair binary..PP.B watchdog can run an external command for user-defined tests. A return codenot equal 0 means an error occured and watchdog should react. If the externalcommand is killed by an uncaught signal this is considered an error by watchdogtoo.The command may take longer than the time slice defined for the kernel devicewithout a problem. However, error messages aregenerated into the syslog facility. If you have enabled softboot on errorthe machine will be rebooted if the binary doesn't exit in half the time.B watchdog sleeps between two tries triggering the kernel device..PPIf you specify a repair binary it will be started instead of shutting downthe system. If this binary is not able to fix the problem .B watchdog will still cause a reboot afterwards..PPIf the machine is halted an email is sent to notify a human thatthe machine is going down. Starting with version 4.4 .B watchdog will also notify the human in charge if the machine is rebooted..SH "SOFT REBOOT"A soft reboot (i.e. controlled shutdown and reboot) is initiated for everyerror that is found. Since there might be no more processes available,watchdog does it all by himself. That means:.IP 1. 4Kill all processes with SIGTERM..IP 2. 4After a short pause kill all remaining processes with SIGKILL..IP 3. 4Record a shutdown entry in wtmp..IP 4. 4Save the random seed from .IR /dev/urandom . If the device is non-existant orthere is no filename for saving this step is skipped..IP 5. 4Turn off accounting..IP 6. 4Turn off quota and swap..IP 7. 4Unmount all partitions except the root partition..IP 8. 4Remount the root partition read-only..IP 9. 4Shut down all network interfaces..IP 10. 4Finally reboot..SH "CHECK BINARY"If the return code of the check binary is not zero .B watchdog will assume anerror and reboot the system. Be careful with this if you are using thereal-time properties of watchdog since .B watchdog will wait for the return ofthis binary before proceeding. An positive exit code is interpreted as ansystem error code (see .I errno.h for details). Negative values are special to.BR watchdog :.TP\-1 Reboot the system. This is not exactly an error message but a command to.BR watchdog . If the return code is \-1 .B watchdog will not try to run a shutdownscript instead..TP\-2 Reset the system. This is not exactly an error message but a command to.BR watchdog . If the return code is \-2 .B watchdog will simply refuse to write thekernel device again..TP\-3 Maximum load average exceeded..TP\-4 The temperature inside is too high..TP\-5 .I /proc/loadavg contains no (or not enough) data..TP\-6 The given file was not changed in the given interval..TP\-7 .I /proc/meminfo contains invalid data..TP\-8Child process was killed by a signal..TP\-9Child process did not return in time..TP\-10 Free for personal use..SH "REPAIR BINARY"The repair binary is started with one parameter: the error number thatcaused .B watchdog to initiate the boot process. After trying to repair thesystem the binary should exit with 0 if the system was successfully repairedand thus there is no need to boot anymore. A return value not equal 0 tells.B watchdog to reboot. The return code of the repair binary should be the errornumber of the error causing .B watchdog to reboot. Be careful with this if youare using the real-time properties since .B watchdog will wait forthe return of this binary before proceeding..SH BUGSNone known so far..SH AUTHORSThe original code is an example written by Alan Cox<alan@lxorguk.ukuu.org.uk>, the author of the kernel driver. Alladditions were written by Michael Meskes <meskes@debian.org>. Johnie Ingram<johnie@netgod.net> had the idea of testing the load average. He also tookover the Debian specific work. Dave Cinege <dcinege@psychosis.com> broughtup some hardware watchdog issues and helped testing this stuff..SH FILES.TP.I /dev/watchdog The watchdog device..TP.I /var/run/watchdog.pid The pid file of the running .BR watchdog ..SH "SEE ALSO".BR watchdog.conf (5)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -