Kdump on CentOS 6 | linuxsysconfig

microee 2015-07-19

展开全文

kdump is part of the kexec-tools package which provides the kexec binary that facilitates a new kernel to boot using the kernel’s kexec feature either on a normal or a panic reboot. With the help of kdump, kexec and a debug kernel, one can have a much higher chance of finding out why the kernel failed. When a kernel panic occurs, kexec loads a new kernel which collects the crash data and saves it in a special log file which helps troubleshooting the failure.

Kernel crash

This guide shows you how to configure kdump for CentOS 6, but it should also apply to Red Hat Enterprise Linux and Fedora.

This CentOS installation is a guest OS under VirtualBox 4.2.8 and it has the latest kernel installed (as of today).

First some info about the machine

cat /etc/redhat-release
CentOS release 6.3 (Final)
uname -r
2.6.32-279.22.1.el6.x86_64
rpm -qa | grep `uname -r`
kernel-2.6.32-279.22.1.el6.x86_64
kernel-headers-2.6.32-279.22.1.el6.x86_64
kernel-devel-2.6.32-279.22.1.el6.x86_64

Install the required packages

yum --enablerepo=debug install kexec-tools crash kernel-debug kernel-debuginfo-`uname -r`

This will install all required packages and dependencies. Make sure you use `uname -r` or $(uname -r) when installing the debuginfo rpms, otherwise yum could install the latest packages available under the debug repository and not those needed for your kernel version. Also note that kernel-debuginfo is quite large in size (1.5-1.7GB installed) so check your free disk space before the installation.

Modify grub

A kernel argument must be added to /etc/grub.conf to enable kdump. It’s called crashkernel and it can be either auto or set as a predefined value e.g. 128M, 256M, 512M etc. These values define the amount of memory reserved for the capture kernel. I chose 128M for my testing.

title CentOS (2.6.32-279.22.1.el6.x86_64.debug)
root (hd0,0)
kernel /vmlinuz-2.6.32-279.22.1.el6.x86_64.debug ro root=/dev/mapper/vg_centos6-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vg_centos6/lv_swap rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=vg_centos6/lv_root  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM crashkernel=128M
initrd /initramfs-2.6.32-279.22.1.el6.x86_64.debug.img
title CentOS (2.6.32-279.22.1.el6.x86_64)
root (hd0,0)
kernel /vmlinuz-2.6.32-279.22.1.el6.x86_64 ro root=/dev/mapper/vg_centos6-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vg_centos6/lv_swap rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=vg_centos6/lv_root  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM crashkernel=128M
initrd /initramfs-2.6.32-279.22.1.el6.x86_64.img

Enable kdump

chkconfig kdump on
service kdump start
No kdump initial ramdisk found.                            [WARNING]
Rebuilding /boot/initrd-2.6.32-279.22.1.el6.x86_64kdump.img
Starting kdump:                                            [  OK  ]

After this step a reboot is required in order to boot the kernel with the new argument.

shutdown -r now

Confirm kdump is active

service kdump status
Kdump is operational
cat /sys/kernel/kexec_crash_loaded
1
cat /proc/iomem | grep Crash
03000000-12ffffff : Crash kernel

Test kdump i.e. trigger a kernel crash

### Clearly you shouldn’t do this on a production machine! ###

echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

The kernel panic should happen instantly. In theory the debug kernel is loaded by kexec and gathers the crash data. After that the machine will boot into the default kernel. In practice this doesn’t always happen. You may need to tweak the configuration files (/etc/kdump.conf and /etc/sysconfig/kdump) or try different crashkernel options in grub.

There could also be issues with the debug kernel and some existing kernel modules (e.g. megaraid) so you might need to explicitly add those to the extra_modules line in /etc/kdump.conf or prevent them from being added to initrd by using the mkdumprd utility (and its omit-raid-modules option).

Analysing the log file

The default path to store the log file is under /var/crash. With the help of the crash utility you can try to investigate what happened. Most data is pretty cryptic, but with the help of the built-in commands you can at least get some idea of what went wrong.

crash /usr/lib/debug/lib/modules/2.6.32-279.22.1.el6.x86_64/vmlinux /var/crash/127.0.0.1-2013-03-03-20\:14\:21/vmcore
KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.22.1.el6.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2013-03-03-20:14:21/vmcore  [PARTIAL DUMP]
CPUS: 2
DATE: Sun Mar  3 20:13:14 2013
UPTIME: 00:00:56
LOAD AVERAGE: 0.08, 0.03, 0.01
TASKS: 188
NODENAME: centos6.3
RELEASE: 2.6.32-279.22.1.el6.x86_64
VERSION: #1 SMP Wed Feb 6 03:10:46 UTC 2013
MACHINE: x86_64  (2467 Mhz)
MEMORY: 4 GB
PANIC: "Oops: 0002 [#1] SMP " (check log for details)
PID: 8473
COMMAND: "bash"
TASK: ffff88011b550040  [THREAD_INFO: ffff880119322000]
CPU: 0
STATE: TASK_RUNNING (PANIC)

In my case the issue was quite easy to spot as the log command from the crash tool exposed the SysRq triggered crash:

SysRq : Trigger a crash

The bt command also revealed the same thing:

KERNEL-MODE EXCEPTION FRAME AT: ffff8801193238d8
[exception RIP: sysrq_handle_crash+22]
RIP: ffffffff81321d66  RSP: ffff880119323e18  RFLAGS: 00010096
RAX: 0000000000000010  RBX: 0000000000000063  RCX: 0000000000002388
RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000063
RBP: ffff880119323e18   R8: 0000000000000000   R9: ffffffff8163ac60
R10: 0000000000000001  R11: 0000000000000000  R12: 0000000000000000
R13: ffffffff81afb7a0  R14: 0000000000000286  R15: 0000000000000004
ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018

There are other commands that you can run with the crash utility, type help inside the crash prompt to get the full list.

See also some screens while booting the crash kernel after the panic.