Info Docs article 17152 [ Notify of changes ] [ Edit/Retrieve Marked Documents ] [ Mark Document ] Did the search result in the document you were looking for? Send Feedback ------------------------------------------------------------------------ INFODOC ID: 17152 SYNOPSIS: watchdog FAQ DETAIL DESCRIPTION: watchdog FAQ Who is this document for: This document is targeted to customers whose systems are dropping to the ok prompt because of a watchdog reset or other reason. What is the purpose of the document: This document describes how to collect the information that Sun Customer Services needs in order to diagnose the reason for the system behavior. The document mentions infodocs and srdbs, which are available to contract customers via the Sunsolve web site: http://sunsolve.sun.com/ Technical Support Engineers at Sun can send any of the documents you request, as well. What is a watchdog reset? A watchdog reset is an unrecoverable situation that forces the CPU to reset. It is caused as a result of the machine trapping while handling a trap with the "Enable Traps" bit in the Processor Status Register (PSR) being disabled. The reason traps have been disabled is that no other traps should occur unit the first trap has been handled. But because a second trap has occurred and the cpu cannot handle it the machine resets. Are there any other reasons that a system would drop to the ok prompt? There are several other reasons. First, if the system receives a break via the console (because Stop-A was typed or the keyboard was unplugged and replugged on a regular console, or if a break was sent from a tty console), it will halt and produce the ok prompt. We recommend that this be attempted on a hung (unresponsive) system. A kernel feature known as a deadman timer can also be enabled in an effort to diagnose a hung system. If this is enabled, when the system hangs it will be dropped to the ok prompt. Is a watchdog reset the same as a system panic? No. On a system panic, the system saves the kernel context to the system's swap disk, and then sets a flag indicating there is a crash dump before it reboots. If savecore is enabled, a crash dump is recovered during reboot, or a manual savecore can be run shortly after the reboot. On a watchdog reset, minimal information is saved, and then the system simply halts. What happens when a system gets a watchdog? The behavior of the system after a watchdog is determined by the value of the watchdog-reboot prom variable. To see the value, from a running system use the eeprom command. From an ok prompt, use the env command. The default value (here as output from the eeprom command) is false: watchdog-reboot?=false This value will cause the system to stay at the ok prompt after it happens. If watchdog-reboot is set to true, the system will reboot automatically. If a system is rebooting for no discernable reason, we advise checking the value of this parameter and setting it to false if it is true. If the system had been experiencing watchdog resets, this will allow the collection of useful data next time it happens. Is the procedure for dealing with watchdogs the same for all Sun systems? No. Some of the commands will work on all systems, and others are only relevent to certain architectures and configurations. You can determine the architecture of a running system by using the command uname -a, and observing the fifth field returned, which should be sun4, sun4d, sun4m, sun4u, etc. You can determine if a system is a multi-processor (MP) system by using the mpstat command. If it returns just one line, it is a single-cpu system. Otherwise, it is a multi-processor system with a cpu represented by each line of output. An MP system will include in the prom prompt an indication of what cpu experienced the halt, for example <#2> which indicates cpu2. Please write down the number, as it can be helpful in identifying which cpu to replace if the cause is found to be a defective one. Is there any way to tell for certain if a watchdog reset has occurred? Systems with a sun4d or sun4u architecture have a command called prtdiag. This is usually not in the default command path, so if you do not know where to find it, do man prtdiag to find the path. prtdiag -v will display configuration data, followed by time of the last watchdog if one has occurred. What should be done when the system has dropped to the ok prompt? Some commands should be run to capture the state of the system, and then the sync command should be used to force a panic and create a crash dump on the reboot. Since the system is not running when it is at the ok prompt, the output of the commands will not be saved. You must write down the results, or use the serial console port connected to a tip session to capture the results. Infodoc 15085 tells how to configure for a tip session. What commands should be typed from the ok prompt? The commands are described below. There is a feature called obpsym which, when enabled, will allow certain of the commands to provide symbolic information which will make interpretation by Sun Customer Services easier (and probably faster). If you do not know how to enable this, ask someone at Sun to send you internal infodoc 15876. Commands that work on all systems: (modload /platform/$(uname -m)/kernel/misc/obpsym #the OpenBootProm Symbols. This is needed if ctrace is not there. Also, put the /etc /sytems: forceload: misc/obpsym) .registers This displays the internal registers of the current cpu. .locals This displays the registers in the current register window. ctrace This displays the kernel stack. If obpsym is enabled, the output includes useful symbolic information. If not, it produces numbers which must be interpreted in conjunction with a crash dump. This is the single most useful command. System-specific commands: .psr Only available on systems supporting SPARC V8 architecture. If you're not certain, try it. Prints the Processor Status Register in a readable format. wd-dump Only available on sun4d architecture. Displays watchdog data including the program counter of the instruction that caused the crash. What should be done after the commands are typed, and the results recorded? **** added by matt baker **** Don't do a sync - it usually hangs the system and you will **** then need to do a power cycle of the system. **** Do a "boot", and when it comes up, it will do a savecore, **** unless you disabled savecore, which you can run a manual **** savecore when the system is up (do ASAP though). Type sync which should cause a panic and a reboot. When the system has rebooted, check for a crash dump. If savecore was enabled, the path in /etc/init.d/sysetup should say where to find the corefiles. If savecore was not enabled, manually run savecore -v to produce the corefiles in that directory. This must be done before system activity causes the data in the swap disk to be overwritten. For details on savecore configuration, see infodoc 12031 (for Solaris 2.x only), infodoc 11827 (for SunOS 4.x only), and infodoc 14230 (for both o.s.'s). After savecore has been done, contact Sun Customer Services to have someone collect the data and analyze it. PRODUCT AREA: Kernel PRODUCT: crash SUNOS RELEASE: Solaris 2.x HARDWARE: any ------------------------------------------------------------------------ Home | Free Services | Contract Services | Account Services | Table of Contents Comments about SunSolve | Help | SunSolve FAQ's | Privacy Policy Questions or comments regarding this service? webmaster Copyright 1994-1998 Sun Microsystems, Inc., 901 San Antonio Road, Palo Alto, CA 94303 USA. All rights reserved.