################################################################################ #Configuring Kdump on Linux # #Covers: RHEL 5,6,7 # #Version 2017.03.12.0001 ################################################################################ ################################################################################ #TOC #Need/issue #Caveats #When a kdump is activated #KDUMP config: READMEs #KDUMP config: BEGIN #Caveats #Managing a crash/hang #Crash commands #Miscellaneous #Bibliography ################################################################################ ################################################################################ #Need/issue ################################################################################ The kdump procedure The received warning means the kdump operation might fail and the crashdump parameter should be configured correctly. This is the procedure of kdumping: Note: Not reserving enough memory for the kdump kernel can lead to the kdump operation failing. 1 - The normal kernel is booted with crashkernel=... as a kernel option, reserving some memory for the kdump kernel. The memory reserved by the crashkernel parameter is not available to the normal kernel during regular operation. It is reserved for later use by the kdump kernel. 2 - The system panics. 3 - The kdump kernel is booted using kexec, it used the memory area that was reserved w/ the crashkernel parameter. 4 - The normal kernel's memory is captured into a vmcore. ################################################################################ #When a kdump is activated ################################################################################ There are several parameters that control under which circumstances kdump is activated. kdump can be activated when - system hang is detected through the Non-Maskable Interrupt (NMI) Watchdog mechanism. This mechanism is enabled through the nmi_watchdog=1 kernel parameter. Refer to What is NMI and what can I use it for? for details - hardware NMI button is pressed. This mechanism is enabled by setting the sysctl kernel.unknown_nmi_panic=1 . - the out-of-memory killer (oom-killer) would otherwise be triggered. This can be configured by setting the sysctl vm.panic_on_oom=1 - "unrecovered" NMI has occurred. This mechanism is enabled by setting the sysctl kernel.panic_on_unrecovered_nmi=1 . The following kernel warning messages are associated with "unrecovered" NMIs: Uhhuh. NMI received for unknown reason *hexnumber* on CPU *CPUnumber*. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue Under many circumstances it is advisable to enable multiple tunables from the above list. As an example, in the event of hang events, it is adviseable to enable kernel.unknown_nmi_panic, kernel.softlockup_panic, and also nmi_watchdog=1. This will increase the likelihood that a vmcore will result from an event that an administrator may not be directly monitoring at the time. ################################################################################ #Caveats ################################################################################ Update to lastest patches There are many many patches that affect kdump running right If clustered Fencing needs to allow enough time to get a vmcore If on HP hardware ASR needs to be disabled w/ systems w/ large memory Other items If you are dumping to local storage and utilize the hpsa storage module that you may run into difficulty capturing a core. In that event, please ensure you are on the latest kexec-tools package. To output a list of configured dump locations, run the following egrep command: egrep \ "path|raw|nfs|ssh|ext4|ext3|ext2|minix|btrfs|xfs|auto" \ /etc/kdump.conf \ | grep -v ^# Console frame-buffers and X are not properly supported. On a system typically run with something like "vga=791" in the kernel config line or with X running, console video will be garbled when a kernel is booted via kexec. The kdump kernel should still be able to capture a dump, and when the system reboots, video should be restored to normal. debug_mem_level is a new parameter from RHEL6.3, it turns on debug/verbose output of kdump scripts regarding free/used memory at various points of execution. Higher level means more debugging output. If unable to obtain a kernel dump but the machine can be rebooted, consider checking the system's RAM. RPM - memtest86+ CMD - memtest-setup #---------------------------------------------------------------------- Issue Kdump fails/hangs on HP BL460c G7 using P220i/P410i controller with the following message on console: hpsa 0000:05:00.0: hpsa0: <0x233b> @ IRQ 105 using DAC INFO: task insmod:276 blocked for more than 120 seconds. Environment Red Hat Enterprise Linux 6 HP BL460c-G7, Controller P220i, Firmware 1.29 HP BL460c-G7, Controller P220i, Firmware 3.04 HP BL460c G7, Controller P410i, Firmware 3.52 HP BL460c G7, Controller P410i, Firmware 5.06 Resolution Firmware updates For HP BL460c G7, Controller P410i, Firmware should be >= 5.70 For HP BL460c-G7, Controller P220i, Firmware should be >= 3.04 Make sure kernel parameter "noapic" is not included or passed via the KDUMP_COMMANDLINE_APPEND directive File /etc/sysconfig/kdump KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices \ cgroup_disable=memory mce=off" grep KDUMP_COMMANDLINE_APPEND /etc/sysconfig/kdump | grep noapic If noapic **IS** used in kdump configuration line. Remove it, save the file and then rebuild kdump #---------------------------------------------------------------------- Issue Certain HP ProLiant servers may still be unable to generate a crash dump (vmcore file) even if Kdump service is configured correctly. Environment Red Hat Enterprise Linux 5 Red Hat Enterprise Linux 6 HP ProLiant server, certain models. Resolution There are number of issues which should be addressed in order to generate a crash dump (vmcore file) on certain models of HP ProLiant servers even after Kdump service is configured correctly. Double-check for each of solutions below if it is applicable to your exact server model. Intel-based servers have certain issues with crashdump when intel_iommu kernel parameter is not set to off. See the related article "Cannot collect a vmcore with kdump while Intel IOMMU is turned enabled" and Red Hat Bugzilla 719237 for the details. KDump may fail to store the crashdump file on a local drive with cciss RAID-controller running older firmware versions. See the article kdump fail to dump core file to cciss target running firmware versions lower than v5.06 for the details. Note the comment from HP Support stating that certain issues can be resolved with firmware version 5.70 or later. Kdump may fail on HP ProLiant servers using the hpsa driver for local drives. See the article Why does kdump fail on HP system using the 'hpsa' driver for storage in Red Hat Enterprise Linux 6? for the details. HP ProLiant servers with a large amount of RAM, 384G for example, may be hitting the issue described in the following HP Advisory HP ProLiant Servers with a Large Memory Configuration - Linux kdump May Not Collect a Dump if the hp-asrd Service and hpwdt Are Enabled When a Panic Occurs HP ProLiant server may be hitting the issue decribed in the following HP Advisory Red Hat Enterprise Linux 6.2 ... - Linux Kernel May Be Tainted on HP ProLiant Servers Configured With an Intel Xeon E5-2600-Series Processor HP ProLiant server with large amount of RAM may have just not enough space on a KDump target volume to save the crashdump. Configure core_collector option in /etc/kdump.conf to compress the vmcore file so its size fits the space available on the partition that it is written to. See the following section of the solution for the details on core_collector option Sizing Local Dump Targets Setting crashkernel=auto may not reserve enough memory for the crash kernel if a certain number of 3rd-party modules are used. This means that the OOM killer can wake up and kill processes while running the crash kernel. In this case more memory might have to be reserved with a crashkernel kernel parameter. So, if a test crashdump fails it is a good strategy to verify if it works with crashkernel=256M@0M or even with crashkernel=768M@0M. If it still fails, do further debugging of the memory requirements using the debug_mem_level option in /etc/kdump.conf. See more details on this in the articles How should the crashkernel parameter be configured for using kdump on RHEL6? and kdump memory usage improvements included in Red Hat Enterprise Linux 6.2. Kdump is properly configured, it works the first time a crash happens but fails to work on subsequent crashes. Why Kdump Validation Fails When Invoking Crash Dump Using HP Integrated Lights-Out (iLO) ? If the Kdump still fails to generate a crashdump, a full console output from the moment crashdump was initiated is needed. Configure kernel log redirection to the COM port and save COM port data according to the article How to setup virtual serial console for a HP system with iLo?. #---------------------------------------------------------------------- Issue Kdump Sending Dump File to Root Filesystem Even Though It Should Be Sent to Another Server via SSH Kdump fails to dump to NFS server Environment Red Hat Enterprise Linux 5 kexec-tools-1.102pre-154 Resolution Update to kexec-tools-1.102pre-161.el5 (from RHBA-2013-0012) or later. This addresses an issue tracked through Red Hat private bugzilla #802928. Make sure the /etc/sysconfig/network-scripts/ifcfg-bond* files contain the line: BOOTPROTO=static Then force the system to build a new kdump initrd: # touch /etc/kdump.conf # service kdump restart Root Cause A change in the way kdump handles bonding network devices prevents network devices from being configured correctly if they have static IP addresses, but are not marked as static devices. #---------------------------------------------------------------------- Issue Item fence_kdump doesn't work when using a bonding device for hearbeat kdump can not save vmcore via interface bond1 If system has more than one bonding devices, and if kdump target network file server is connected by bondX (which is not the first bond device bond0), then it may fail to save kdump since kdump kernel can not bring up the interface other than bond0 Environment Red Hat Enterprise Linux 6 Red Hat Enterprise Linux 5 bonding kdump Resolution Fixed in Errata RHBA-2013:0281-1 for from private bug 859824 Alternately, in /etc/kdump.conf file, add the line below as a workaround if there are 2 bonding device in system, and the max_bonds parameter always need match the bonding device number on system. Raw options bonding max_bonds=2 Root Cause The max_bonds parameter specifies the number of bonding devices to create for this instance of the bonding driver. E.g., if max_bonds is 3, and the bonding driver is not already loaded, then bond0, bond1 and bond2 will be created. The default value is 1. Diagnostic Steps Check to see whether your heartbeat network is on a bond device other than bond0 #---------------------------------------------------------------------- Issue When using kdump over NFS with a target specified as hostname, the resolving of the IP address does not work during startup. unable to mount NFS during a kdump Environment Red Hat Enterprise Linux 5 Red Hat Enterprise Linux 6 kdump bnx2 driver Resolution The bnx2 driver takes some time to initialize. Adding a delay to the kdump configuration will fix the issue: Append link_delay 60 to /etc/kdump.conf Rebuild the kdump initrd (service restart kdump) Root Cause The network card (bnx2) needs some time to initialize. Diagnostic Steps Ensure that all configuration is correct at /etc/kdump.conf Touch /nfs/location to ensure that it's writable View current kdump in process, if kdump is started and on the server you don't see any files being transferred, the issue may be related to network. #---------------------------------------------------------------------- Issue On system with BCM5718 the kdump to remote host using ssh fails 3 out of 5 tries because no link is detected. The following messages are displayed when it fails: mapping eth0 to eth0 Saving to remote location root@192.168.110.23 lost connection Attempting to enter user-space to capture vmcore Please note that: The normal kernel always works. The problem doesn't happen when testing NetXtream BCM5709 NIC, which unfortunately uses bnx2 driver, instead of tg3. Using a small static compiled application to set autoneg on after bring the interface UP to force a PHY reset didn't help. passing acpi=off didn't help. adding "link_delay 120" still fails 1 out of 5 tries. eth0 Link Up. Waiting 120 Seconds Continuing Saving to remote location root@XXX.XXX.XXX.XXX lost connection ... Shutting down interface eth0: tg3: eth0: Link is up at 1000 Mbps, full duplex. tg3: eth0: Flow control is on for TX and on for RX If the interface used by kdump is NOT brought UP in the normal kernel, then it works 100% of the attempts. Looping doing ifup eth0; ifdown eth0; in the normal kernel always get a link established. Environment Red Hat Enterprise Linux 5.6 Red Hat Enterprise Linux 6.1 Resolution In RHEL5, update to kernel-2.6.18-348.el5(from RHBA-2013-0006) or later. In RHEL6, update to kernel-2.6.32-279.el6(from RHSA-2012-0862) or later. Root Cause The kdump kernel maintains the configuration of MSI-X interrupts as created by the crashed kernel but enables only one CPU in the new environment. Previously, this caused the tg3 driver to abort MSI-X setup which caused interrupt delivery to fail. Consequently, the link became unavailable and any attempt to dump a core file to a remote host to failed. With this update, the tg3 driver has been modified to enforce single-vector MSI-X interrupt mode by disabling the multi-vector interrupt mode for tg3 in the kdump kernel. The NIC is now brought up as expected and kdump can successfully dump a core file to the remote host in this scenario. Diagnostic Steps I confirmed this problem occurs with 2.6.18-238.5.1.el5 as latest RHEL5 kernel. check if the interface works when it has been not initialized in the normal kernel check if the interface works when restarted few times in the normal kernel provided a small program to issue a phyreset (turn autoneg on) to see if the link is negotiated. ( first test with 5 seconds seemed not enough, increasing to 20s..) #---------------------------------------------------------------------- Issue Kdump fails on a KVM guest if balloon memory is involved no vmcore is dumped due to an OOM-killer storm within Kdump context Environment Red Hat Enterprise Linux 6 KVM guest kdump attempting to dump a vmcore Resolution Update kexec-tools package to the following errata: RHBA-2013-0281 As a workaround, one can issue the following commands at a root shell session: i. echo "blacklist virtio_balloon" >> /etc/kdump.conf ii. touch /etc/kdump.conf && /etc/init.d/kdump restart #############################################################################o #KDUMP config: READMEs #############################################################################o ####################################### #Time needed to get a dump ####################################### Dumping time depends on the options that are used for its configuration. Below are the some of the factors which should be taken in consideration. Storage speed Memory speed Data Compression used Dump filter level Network if dump target is remote storage Estimation can be made by dumping whole memory to disk and measuring the time needed for it can be considered as probable value to dump the vmcore. However it would be generic value, no specific table will give time statistics. ####################################### #Dump Level - Pages to filter ####################################### #Uses a BIT MASK #makedumpfile [-d DL]: #Specify the type of unnecessary page for analysis. #Pages of the specified type are not copied to DUMPFILE. The page type #marked in the following table is excluded. A user can specify multiple #page types by setting the sum of each page type for Dump_Level (DL). #The maximum of Dump_Level is 31. #Note that Dump_Level for Xen dump filtering is 0 or 1. # DL of 0 - gets ALL pages # DL of 1 - gets all pages EXCEPT zero pages # DL of 17 - exclude Zero pages (1) and Free pages (16) = 17 # DL of 31 - gets ONLY active pages (excludes all other types) # | cache cache # Dump | zero without with user free # Level | page private private data page # ------+--------------------------------------- # 0 | # 1 | X # 2 | X # 4 | X X # 8 | X # 16 | X # 31 | X X X X X RH commonly show 31 (only active, nothing else) Matt suggests 17,31 (no free and no zero :OR: get only active) if "not enough room" then it goes to level 31 and retries ####################################### #Size crashkernel value ####################################### Matt suggests: crashkernel=auto Otherwise, if we ever grow/change memory on a system, we have to remember to change this. Use of auto is recommended in RHEL7. Starting with RHEL6.2 kernels crashkernel=auto should be used. But there are caveats with memory size (see below notes). Before that, size values must be calculated. Use: grubby --update-kernel=DEFAULT --args=crashkernel=$crashkernel_para ################### #RHEL 5 ################### crashkernel=memory@offset +---------------------------------------+ | RAM | crashkernel | crashkernel | | size | memory | offset | |-----------+-------------+-------------| | 0 - 2G | 128M | 16M | | 2G - 6G | 256M | 24M | | 6G - 8G | 512M | 16M | | 8G - 24G | 768M | 32M | +---------------------------------------+ For RAM size greater than 24G: Try crashkernel memory 768M and RAM/crashkernel offset of 32, which looks like 768M@32M. If you get an Out-Of-Memory error message, then try with increasing the crashkernel parameter to 896M Depending on your system BIOS memory layout, you may need to alter the offset. A complete procedure to correctly determine the maximum possible size and precise offset is located at: How to properly calculate the crashkernel setting? Additional Notes Always test to ensure that the kdump service starts correctly and that the system is able to correctly dump by initiating a test. The offset for the kdump memory reservation (crashkernel=X@Y) must be specified in RHEL5. Not specifying offset (crashkernel=X) is not a valid configuration under RHEL5, although it is valid under RHEL6. kdump fails to initialise with crashkernel=1024M@16M on RHEL5 kernels earlier than 2.6.18-274.el5 RHEL6's kdump is more memory-efficient than RHEL5's. It is likely more memory will need to be assigned on RHEL5 than on the same system running RHEL6. For settings of kdump on other version of Red Hat Enterprise Linux, please refer to: How should the crashkernel parameter be configured for using kdump on Red Hat Enterprise Linux? ################### #RHEL 6 ################### Configuring crashkernel on RHEL6.0 and RHEL6.1 kernels The code for printing the warning: Raw Your running kernel is using more than 70% of the amount of space you reserved for kdump, you should consider increasing your crashkernel reservation is part of the script /etc/init.d/kdump. The involved code First reads the Slab value from /proc/meminfo. Slab is the in kernel data structures cache, this value depends on the total amount of RAM present in the system as well as on other factors. The value is not consistent and can change during operation of the server. If the Slab value is bigger than 70% of the memory that was reserved with the crashkernel parameter then the warning is printed.Some mappings of ram and appropriate crashkernel values: ram size crashkernel parameter ram / crashkernel factor >0GB 128MB 15 >2GB 256MB 23 >6GB 512MB 15 >8GB 768MB 31 The last column contains a ram/crashkernel factor. The table is covered by the following crashkernel configuration: Raw crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M For servers with more RAM it is recommended to compute the crashkernel parameter using the factors that have been observed so far: 15 to stay on a safe side (maybe wasting memory), using a factor of 20 should also work. Please also note that the maximum size of RAM that should be reserved here is 896M, as outlined in (private) bz580843. Configuring crashkernel on RHEL6.2 (and later) kernels Starting with RHEL6.2 kernels crashkernel=auto should be used. The kernel will automatically reserve an appropriate amount of memory for the kdump kernel. Keep in mind that it is the best effort memory reservation and might not meet the needs of all systems (Especially for configurations with lots of IO cards and loaded drivers). So always make sure that memory reserved by crashkernel=auto is sufficient for the target machine by testing kdump. If it is not, reserve more memory by syntax crashkernel= XM (X is amount of memory to be reserved in mega bytes). Additionally some improvements have been made in the RHEL6.2 kernel which have reduced the overall memory requirements of kdump. For more details refer to article kdump memory usage improvements included in Red Hat Enterprise Linux 6.2. The amount of memory reserved for the kdump kernel can be estimated with the following scheme: Raw base memory to be reserved = 128MB an additional 64MB added for each TB of physical RAM present in the system. So for example if a system has 1TB of memory 192MB (128MB + 64MB) will be reserved. Note: It is recommended to verify that kdump is working on all systems after installation of all applications. The memory reserved by crashkernel=auto takes only typical RHEL configurations into account. If 3rd party modules are used more memory might have to be reserved. Thus, if a testdump fails it is a good strategy to verify if it works with crashkernel=768M@0M and if it does do further debugging of the memory requirements using the debug_mem_level option in /etc/kdump.conf. It is recommended that until a test dump works without failure that kdump not be considered configured properly. Note: Prior to the 6.3GA release, crashkernel=auto will only reserve memory on systems with 4GB or more physical memory. If the system has less than 4GB of memory the memory must be reserved by explicitly requesting the reservation size, for example: crashkernel=128M. Since the 6.3GA release (kernel-2.6.32-279.el6), this limit has been lowered to 2GB. Note: Some environments still require manual configuration of the crashkernel option, for example if dumps to very large local filesystems are performed. Please refer to kdump fails with large ext4 file system because fsck.ext4 gets OOM-killed for details. Further information If you are experiencing problems with your crashkernel setting see How to properly size and position the crashkernel? For settings of kdump on other version of Red Hat Enterprise Linux, please refer to: How should the crashkernel parameter be configured for using kdump on Red Hat Enterprise Linux? Root Cause A number of improvements related to crashkernel=auto and memory requirements of kdump have been made in the RHEL6.2 kernel. Diagnostic Steps The method used (pre-6.2) to calculate the approx amount of ram the normal kernel is using (from the /etc/init.d/kdump): Raw KMEMINUSE=`awk '/Slab:.*/ {print $2}' /proc/meminfo` Question: Is it possible to find out how much memory was reserved for the kdump kernel? Answer: This is available when executing cat /proc/cmdline. Even when the kernel was started with crashkernel=auto then /proc/cmdline will contain the computed value that got reserved. To verify that crashkernel=auto was really used the contents of /var/log/dmesg can be used. cat /proc/cmdline cat /sys/kernel/kexec_crash_size Question: I found out that 'sync; echo 3 > /proc/sys/vm/drop_caches' frees up Slab, can I use this regularly and then use a lower value for 'crashkernel'? Answer: This is not recommended. This command is dropping filesystem caches, when after execution data is requested by processes the data has to be read from disc/blockdevices, resulting in a degraded system performance. Question: On my system I did setup kdump. When triggering the kdump then kdump is not loaded completely. Answer: Are 3rd party drivers in use on the system, changing memory requirements? Does the system successfully kdump when crashkernel=768M@0M is used, or a different manual allocation that is bigger than the amount of memory that crashkernel=auto did reserve for the crash kernel? If this is the case then with the debug_mem_level option in /etc/kdump.conf the required amount of memory can be found out and the memory that has to be reserved for the crashkernel can be cut down. ################### #RHEL 7 ################### Starting with RHEL7 kernels crashkernel=auto should be used. The kernel will automatically reserve an appropriate amount of memory for the kdump kernel. Keep in mind that it is the best effort memory reservation and might not meet the needs of all systems (Especially for configurations with lots of IO cards and loaded drivers). So always make sure that memory reserved by crashkernel=auto is sufficient for the target machine by testing kdump. If it is not, reserve more memory by syntax crashkernel= XM (X is amount of memory to be reserved in megabytes). The amount of memory reserved for the kdump kernel can be estimated with the following scheme: Raw base memory to be reserved = 160MB an additional 2 bits added for every 4 KB of physical RAM present in the system. So for example if a system has 1TB of memory 224 MB is the minimum (160 + 64 MB). Note: It is recommended to verify that kdump is working on all systems after installation of all applications. The memory reserved by crashkernel=auto takes only typical RHEL configurations into account. If 3rd party modules are used more memory might have to be reserved. Thus, if a testdump fails it is a good strategy to verify if it works with crashkernel=768M@0M and if it does do further debugging of the memory requirements using the debug_mem_level option in /etc/kdump.conf. It is recommended that until a test dump works without failure that kdump not be considered configured properly. Note: RHEL7 with crashkernel=auto will only reserve memory on systems with 2GB or more physical memory. If the system has less than 2GB of memory the memory must be reserved by explicitly requesting the reservation size, for example: crashkernel=128M. Note: Some environments still require manual configuration of the crashkernel option, for example if dumps to very large local filesystems are performed. Please refer to kdump fails with large ext4 file system because fsck.ext4 gets OOM-killed for details. further information RHEL7 product documentation Kernel Crash Dump Guide RHEL7 product documentation Kernel Crash Dump Guide: kdump memory requirements How should the crashkernel parameter be configured for using kdump on Red Hat Enterprise Linux? Root Cause A number of improvements related to crashkernel=auto and memory requirements of kdump have been made in RHEL7. ####################################################### #What is the SysRq Facility and how do I use it? ####################################################### #The SysRq facility is one of the best (and sometimes the only) way #to determine what a machine is really doing. When triggered, SysRq #will send a signal requesting some diagnostic information to the #operating system kernel. This is most useful when a system appears #to be "hung", and for diagnosing elusive, transient, kernel-related #problems. #What is the "Magic" SysRq key? #According to the Linux kernel documentation: #It is a 'magical' key combo you can hit to which the kernel will #respond regardless of whatever else it is doing, even if the #console is unresponsive. #How do I enable and disable the SysRq key? #For security reasons, Red Hat Enterprise Linux disables the SysRq #key by default. To enable it, run: echo 1 > /proc/sys/kernel/sysrq #And to disable it again: echo 0 > /proc/sys/kernel/sysrq #To enable it permanently, set the kernel.sysrq value in #/etc/sysctl.conf to 1. This will cause it to be enabled on reboot. # grep sysrq /etc/sysctl.conf kernel.sysrq = 1 #Since enabling SysRq gives someone with physical console access extra abilities, it is recommended to disable it when not troubleshooting a problem or to ensure that physical console access is properly secured. #How do I trigger a SysRq event? #There are several ways to trigger a SysRq event. On most #architectures SysRq events can be triggered from the console with #the following key combination: Alt+PrintScreen+[CommandKey] #For instance, to tell the kernel to dump memory info (command key #"m"), you would hold down the Alt and Print Screen keys, and then #hit the m key. #Note that this will not work from an X Window System screen. You #should first change to a text virtual terminal. Hit Ctrl+Alt+F1 to #switch to the first virtual console prior to hitting the SysRq key #combination. #On a serial console, you can achieve the same effect by sending a #Break signal to the console and then hitting the command key within #5 seconds. This also works for virtual serial console access #through an out-of-band service processor or remote console like HP #iLO, Sun ILOM and IBM RSA. Refer to service processor specific #documentation for details on how to send a Break signal; for #example, How to trigger SysRq over an HP iLo Virtual Serial Port (VSP). #If you have a root shell on the machine (and the system is #responding enough for you to do so), you can also write the command #key character to the /proc/sysrq-trigger file. This is useful for #triggering this info when you are not on the system console or for #triggering it from scripts. echo 'm' > /proc/sysrq-trigger #This method has the additional benefit of working even when #kernel.sysrq is set to 0. #When I trigger a SysRq event that generates output, where does it go? #When a SysRq command is triggered, the kernel will print out the #information to the kernel ring buffer and to the system console. #This information is normally logged via syslog to /var/log/messages. #Unfortunately, when dealing with machines that are extremely #unresponsive, syslogd is often unable to log these events. In these #situations, provisioning a serial console is often recommended for #collecting the data. #What sort of SysRq events can be triggered? #There are several SysRq events that can be triggered once the SysRq #facility is enabled. These vary somewhat between kernel versions, #but there are a few that are commonly used: -m - dump information about memory allocation -t - dump thread state information -p - dump current CPU registers and flags -c - intentionally crash the system (useful for forcing a disk or netdump) -s - immediately sync all mounted filesystems -u - immediately remount all filesystems read-only -b - immediately reboot the machine -o - immediately power off the machine (if configured and supported) -f - start the Out Of Memory Killer (OOM) -w - dumps tasks that are in uninterruptable (blocked) state - [Introduced with kernel 2.6.32] #Before using the SysRq facility, please consult with your vendors #as third party applications may be impacted. #############################################################################o #KDUMP config: BEGIN #############################################################################o ####################################### # Some manpages ####################################### # man makedumpfile # man makedumpfile.conf # man crash ####################################### # Areas needing changes ####################################### #need pkg: kexec-tools #grub.conf: add kernel option #kdump.conf: add/modify #sysctl.conf: add entries #turn on: chkconfig kdump on #reboot server to enable changes to system ####################################### # 1 - Local (raw or fs) or NFS or SSH ####################################### ####################################### #PRE - all methods ####################################### rpm -q kexec-tools || yum -y install kexec-tools #If ppc64|s390x -> yum -y install kernel-dump ####################################### #Local ####################################### ####################################### #Local - Filesystem (preferred) ####################################### #Disk(s) should be equal to memory, but, if you use the right options #it will only use somewhere from 3-10GB for the active pages #We will be using /crash_dir for our area (shared with mondo ISOs) # FS - should already be there MYDISK=/dev/mapper/mpathdX NEWSIZE=xG pvcreate $MYDISK [$MYDISK ...] vgcreate vgcrash $MYDISK [$MYDISK ...] lvcreate -n lvcrash -L $NEWSIZE vgcrash mkfs -t ext3 /dev/mapper/vgcrash/lvcrash #get UUID blkid /dev/mapper/vgcrash-lvcrash | tee /tmp/myuuid vi /etc/kdump.conf #path /var/crash #path / path /crash_dir ext3 UUID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx :r /tmp/myuuid core_collector makedumpfile -c --message-level 1 -d 17,31 default reboot ####################################### #Local - raw ####################################### #Disk should be equal to memory, but, if you use the right options #it will only use somewhere from 3-10GB for the active pages #This extra disk could be temporary. Added for a duration that #the issue is happening, until it is solved, then removed. # raw use, no FS NEWSIZE=XXXG MYDISK=/dev/mapper/mpathdX pvcreate $MYDISK [$MYDISK ...] vgcreate vgcrash $MYDISK [$MYDISK ...] lvcreate -n lvrawcrash -L $NEWSIZE vgcrash blkid /dev/mapper/vgcrash-lvcrash #get UUID vi /etc/kdump.conf path / #ext3 UUID=294d60b2-96a1-4135-aba1-ebe1d8af65f1 ext3 UUID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx core_collector makedumpfile -c --message-level 1 -d 17,31 default reboot ####################################### #NFS ####################################### #setup normal export rules for the client's access, # root write needed # # old: net # new: nfs, nfs4 # vi /etc/kdump.conf nfs NFSSERVER:EXPORTED-MTPT #example: nfs sever.com:/crashcores #In the NFSMOUNT directory, kdump will create a subdir of: # ./var/crash/%HOSTIP-%DATE core_collector makedumpfile -c --message-level 1 -d 17,31 default reboot # # Example output from a kdump on NFS mounts # #sever.com:/unixteam/ISO 108G 97G 5.8G 95% /nfs/matt #ls -lRh /nfs/matt/var #/nfs/matt/var: #total 4.0K #drwxr-xr-x 4 nobody nobody 4.0K Aug 28 15:47 crash # #/nfs/matt/var/crash: # ... 10.131.164.40-2015-08-28-19:47:47 # ... 10.131.248.220-2015-08-28-14:57:14 # #/nfs/matt/var/crash/10.131.164.40-2015-08-28-19:47:47: #total 200M #-rw------- 1 nobody nobody 199M Aug 28 15:51 vmcore #-rw-r--r-- 1 nobody nobody 77K Aug 28 15:47 vmcore-dmesg.txt # #/nfs/matt/var/crash/10.131.248.220-2015-08-28-14:57:14: #total 1.1G #-rw------- 1 nobody nobody 1.1G Aug 28 15:17 vmcore #-rw-r--r-- 1 nobody nobody 64K Aug 28 14:57 vmcore-dmesg.txt ####################################### #SSH ####################################### #have enough room on other server #have account there #have ssh keys # # old: net # new: ssh # vi /etc/kdump.conf ssh USER@SERVER:/var/crash/%HOST-%DATE core_collector makedumpfile -c --message-level 1 -d 17,31 default reboot ######################################### # 2 - KDump-Helper ..or.. GUI ..or.. CLI ######################################### # # the example here uses a RAW local disk # adjust those reference to net if you use a different device # ####################################### # KDump-Helper ####################################### # Get from internal location # OR from access.redhat.com/labs/kdumphelper # which asks you config questions and gives you an out of: # a script to run to change things, # or discrete files to put in place kdumpconfig.sh ####################################### # CLI ####################################### #Edit kdump.conf, enable/set the following #vi /etc/kdump.conf #should have been done in previous step 1 #Update grub.conf #Make sure you have: "crashkernel=auto" the end of the default #kernel line grep crashkernel /boot/grub/grub.conf #we are using auto, some older version or some situations will #have issue with this, but works best for the most success and #conssitency grubby --update-kernel=DEFAULT --args=crashkernel=auto grep crashkernel /boot/grub/grub.conf #Update /etc/sysctl.conf SYSCTL_CONF=/etc/sysctl.conf #comment out an existing line value sed -i 's/^kernel.sysrq/#kernel.sysrq/g' $SYSCTL_CONF echo 'kernel.sysrq=1' \ >> $SYSCTL_CONF sed -i 's/^kernel.unknown_nmi_panic/#kernel.unknown_nmi_panic/g' \ $SYSCTL_CONF echo 'kernel.unknown_nmi_panic=1' \ >> $SYSCTL_CONF sysctl -p ####################################### # GUI ####################################### #lots of pkgs - dependencies yum -y install system-config-kdump.noarch xterm #get X setup to display where you have a server export DISPLAY=X.X.X.X:0 system-config-kdump ####################################### #POST - all methods ####################################### chkconfig boot.kump on service boot.kdump start chkconfig kdump on service kdump start #will fail, but ignore, need to reboot #IGNORE Error # Warning: There might not be enough space to save a vmcore. # The size of UUID=xxxxxxx should be greater than xxxxxx kilo bytes. #UNLESS you need a FULL core of all pages, in which case your disk # needs to be at least equal to the size of memory. #NOTE: reboot system - IF the system hasn't been rebooted before # to learn about kdump config, otherwise the restart will rebuild # the initrd file and a reboot is then unneeded service kdump status ####################################### #KDUMP config: END ####################################### ################################################################################ #Managing a crash/hang ################################################################################ ################### #VM ################### #You have an option for systems that hang, to use: # /usr/bin/vmss2core #You will need to create a snapshot of they VM Guest #See access.redhat.com/solutions/411653 ################### #FORCING A CRASH ################### # #if in a cluster, to insure a "quick crash" # # echo "exit 1" > /tmp/kdump_pre; chmod 755 /tmp/kdump_pre # # vi /etc/kdump.conf # kdump_pre /tmp/kdump_pre # default halt # # service kdump restart # ####### ####### crash panic crash panic ####### *** !!! WARNNING - system will panic immediately !!! *** ####### ####### ####### ##### echo 'c' > /proc/sysrq-trigger ####### ####### ####### *** !!! WARNNING - system will panic immediately !!! *** ####### ####### crash panic crash panic ####### ################### #See if you have a crash file, needs to be non-zero ################### ls -l /var/crash/*/vmcore ################################################################################ #Crash commands ################################################################################ ################### #Interactive ################### #Need file: /usr/lib/debug/lib/modules//vmlinux ls /usr/lib/debug/lib/modules/$(uname -r)/vmlinux #ls /lib/modules/2.6.32-504.1.3.el6.x86_64.debug #If not, install from Redhat site for your particular kernel: # yum -y install \ kernel-debug \ kernel-debug-devel yum -y install crash crash-devel debuginfo-install kernel ############################## #Interactive on LIVE kernel ############################## crash ############################## #Work on vmcore from a panic ############################## #CRASHPATH=/var/crash CRASHPATH=/crash_dir/crash # #IF path is a local disk: 127.0.0.1-YYY-MM-DD-HH:MM:SS #IF nfs or ssh: %HOSTIP-YYYY-MM-DD-HH:MM:SS # CRASHDIR=$CRASHPATH/SOME_DIR_FILENAME_STRING ############################## #Interactive on a vmcore ############################## crash -x \ /usr/lib/debug/lib/modules/$(uname -r)/vmlinux \ /var/crash/$CRASHDIR/vmcore set hex sys > crash_data.log bt -a >> crash_data.log mod >> crash_data.log ps >> crash_data.log foreach UN bt >> crash_data.log kmem -i >> crash_data.log net -a >> crash_data.log dev -d >> crash_data.log log >> crash_data.log quit ############################## #Non-nteractive on a vmcore #technique used: shell here-to doc ############################## #crash /usr/lib/debug/lib/modules/`uname -r`/vmlinux \ crash -x \ /var/crash/$CRASHDIR/vmcore << EOF >> set hex sys > crash_data.log bt -a >> crash_data.log mod >> crash_data.log ps >> crash_data.log foreach UN bt >> crash_data.log kmem -i >> crash_data.log net -a >> crash_data.log dev -d >> crash_data.log log >> crash_data.log quit EOF ################################################################################ #Miscellaneous ################################################################################ #Filter out pages from an existing "more full" vmcore #get only the active pages from a larger/fuller vmcore # -c = compress # -d = filter out pages, see other area in this doc makedumpfile -c -d 31 ################################################################################ #Bibliography ################################################################################ -Tool: https://access.redhat.com/labs/kerneloopsanalyzer/ -Main doc: https://access.redhat.com/solutions/6038 -Factors_that_can_affect_vmcore_generation_while_using_kdump._-_Red_Hat_Customer_Portal.pdf -How_can_I_use_crash_to_send_Red_Hat_some_vmcore_pre-analysis_information_before_or_while_uploading_the_vmcore_image__-_Red_Hat_Customer_Portal.pdf -How_should_the_crashkernel_parameter_be_configured_for_using_kdump_on_Red_Hat_Enterprise_Linux__-_Red_Hat_Customer_Portal.pdf -How_should_the_crashkernel_parameter_be_configured_for_using_kdump_on_Red_Hat_Enterprise_Linux_5__-_Red_Hat_Customer_Portal.pdf -How_should_the_crashkernel_parameter_be_configured_for_using_kdump_on_RHEL6__-_Red_Hat_Customer_Portal.pdf -How_should_the_crashkernel_parameter_be_configured_for_using_kdump_on_RHEL7__-_Red_Hat_Customer_Portal.pdf -How_to_capture_a_vmcore_of_hung_Red_Hat_Enterprise_Linux_VMware®_guest_system_using_VMware®__vmss2core__tool___-_Red_Hat_Customer_Portal.pdf -How_to_troubleshoot_kernel_crashes_hangs_or_reboots_with_kdump_on_Red_Hat_Enterprise_Linux_-_Red_Hat_Customer_Portal.pdf -What_is_the_SysRq_Facility_and_how_do_I_use_it__-_Red_Hat_Customer_Portal.pdf ################################################################################ #EOF ################################################################################