Kernel crash dump session write-up - April 2017 meetup
This is a write-up of the kernel crash dump session contributed by Nitin Bansal. The original email is here.
I attended the Pune Kernel Meetup on 8th April and presented a session on Kernel crash dump (vmcore) analysis. In this write-up I have tried to explain a few things about the vmcore analysis, hope it helps.
Before starting with the vmcore analysis, we need following three things:
- crash utility [1]
- vmlinux
- vmcore
vmlinux can be obtained from kernel-debuginfo (for Red Hat Enterprise Linux (RHEL) at least). One can either install kernel-debuginfo or simply extract vmlinux file using rpm2cpio:
# yum install kernel-debuginfo-2.6.32-642.13.1.el6.x86_64
Or
# rpm2cpio kernel-debuginfo-2.6.32-642.13.1.el6.x86_64.rpm | cpio -idv ./usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
vmcore was collected from kernel-2.6.32-642.13.1.el6 so vmlinux from debuginfo package of same version is required.
Now let's use "crash" utility to open this vmcore, which after printing a few details i.e. copyright information, GDB version information etc drops us to "crash>" prompt:
# crash /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux vmcore
crash 7.1.0-6.el6 .. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 2 DATE: Fri Mar 31 11:15:47 2017 UPTIME: 00:06:01 LOAD AVERAGE: 0.40, 0.30, 0.13 TASKS: 232 NODENAME: rhel6 RELEASE: 2.6.32-642.13.1.el6.x86_64 VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016 MACHINE: x86_64 (2808 Mhz) MEMORY: 2 GB PANIC: "Kernel panic - not syncing: hung_task: blocked tasks" PID: 46 COMMAND: "khungtaskd" TASK: ffff880078d16ab0 [THREAD_INFO: ffff880078d20000] CPU: 1 STATE: TASK_RUNNING (PANIC)
crash>
"sys" command can be used to print most of the information that crash prints
when invoked e.g.:
crash> sys KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 2 DATE: Fri Mar 31 11:15:47 2017 UPTIME: 00:06:01 LOAD AVERAGE: 0.40, 0.30, 0.13 TASKS: 232 NODENAME: rhel6 RELEASE: 2.6.32-642.13.1.el6.x86_64 VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016 MACHINE: x86_64 (2808 Mhz) MEMORY: 2 GB PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
To list all the available crash commands, type "?" and hit enter:
crash> ?
* files mach repeat timer alias foreach mod runq tree ascii fuser mount search union bt gdb net set vm btop help p sig vtop dev ipcs ps struct waitq dis irq pte swap whatis eval kmem ptob sym wr exit list ptov sys q extend log rd task
Use "help" to get details about a command and parameters it accepts e.g.:
crash> help sys
Now, let's examine kernel ring buffer (that's where we find panic message) with the help of "log" command.
And following is the panic message, which tells us that khungtaskd crashed the system:
Call Trace: <#DB[1]> <<EOE>> Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1 Call Trace: <NMI> [<ffffffff81009c47>] ? show_regs+0x27/0x30 [<ffffffff8154d909>] ? arch_trigger_all_cpu_backtrace_handler+0x99/0xc0 [<ffffffff8154f1a5>] ? notifier_call_chain+0x55/0x80 [<ffffffff8154f20a>] ? atomic_notifier_call_chain+0x1a/0x20 [<ffffffff810acc8e>] ? notify_die+0x2e/0x30 [<ffffffff8154ce23>] ? do_nmi+0x1c3/0x350 [<ffffffff8154c6e3>] ? nmi+0x83/0x90 [<ffffffff8104615a>] ? native_write_msr_safe+0xa/0x10 <<EOE>> [<ffffffff8103e92c>] ? x2apic_send_IPI_mask+0x6c/0xb0 [<ffffffff81039b9a>] ? arch_trigger_all_cpu_backtrace+0x27a/0x2c0 [<ffffffff810f2740>] ? watchdog+0x250/0x2a0 [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0 [<ffffffff810a640e>] ? kthread+0x9e/0xc0 [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffff810a6370>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Kernel panic - not syncing: hung_task: blocked tasks Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1 Call Trace: [<ffffffff81548301>] ? panic+0xa7/0x179 [<ffffffff81039aa5>] ? arch_trigger_all_cpu_backtrace+0x185/0x2c0 [<ffffffff810f274e>] ? watchdog+0x25e/0x2a0 [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0 [<ffffffff810a640e>] ? kthread+0x9e/0xc0 [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffff810a6370>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20
khungtaskd is a kernel thread which checks for uninterruptible tasks, few relevant snips from kernel/hung_task.c:
static int __init hung_task_init(void) { atomic_notifier_chain_register(&panic_notifier_list, &panic_block); watchdog_task = kthread_run(watchdog, NULL, "khungtaskd"); return 0; } .. static int watchdog(void *dummy) { set_user_nice(current, 0); for ( ; ; ) { unsigned long timeout = sysctl_hung_task_timeout_secs; while (schedule_timeout_interruptible(timeout_jiffies(timeout))) timeout = sysctl_hung_task_timeout_secs; check_hung_uninterruptible_tasks(timeout); } return 0; }
By scrolling back a bit in ring buffer a hung_task_timeout_secs warning can be seen along with the following backtrace:
1 > INFO: task dd:7224 blocked for more than 10 seconds. 2 > Not tainted 2.6.32-642.13.1.el6.x86_64 #1 3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 4> dd D 0000000000000001 0 7224 6702 0x00000080 ffff88007c473ce8 0000000000000082 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ffff880037b71068 ffff88007c473fd8 5> Call Trace: [<ffffffff810a6bce>] ? prepare_to_wait+0x4e/0x80 [<ffffffff8119bdfc>] __sb_start_write+0xdc/0x120 [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff811bcc14>] mnt_want_write+0x24/0x50 [<ffffffff8119b348>] ? get_empty_filp+0xe8/0x190 [<ffffffff811aca99>] do_filp_open+0x2b9/0xd20 [<ffffffff812a885a>] ? strncpy_from_user+0x4a/0x90 [<ffffffff811ba072>] ? alloc_fd+0x92/0x160 [<ffffffff811969f7>] do_sys_open+0x67/0x130 [<ffffffff81196b00>] sys_open+0x20/0x30 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Let's look at it line by line:
- Line 1 shows that dd PID 7224 was blocked for more than 10 seconds.
- Line 2 tells us the kernel version and that it is not tainted (i.e. no third
party modules are loaded etc.).
- Line 3 is just telling us how to disable this message, not a great idea IMO.
- Line 4 contains command name, state, PID etc.
- Line 5 shows the associated call trace which suggests that dd became
uninterruptible while trying to write to a filesystem.
Process details can be checked using "ps":
crash> ps -m | grep UN [0 00:00:14.117] [UN] PID: 7224 TASK: ffff880037b70ab0 CPU: 1 COMMAND: "dd" ^^^^^^^^^^^^^^^^ |_____This is the duration of dd in uninterruptible state (14.117 seconds)
-m switch displays timestamp in days, hours, minutes, seconds and milliseconds.
Next step is to examine dd process, in order to do that, we first need to change the process context (currently it is set to khungtaskd and needs to be switched to dd), which can be done using "set":
crash> set 7224 PID: 7224 COMMAND: "dd" TASK: ffff880037b70ab0 [THREAD_INFO: ffff88007c470000] CPU: 1 STATE: TASK_UNINTERRUPTIBLE
Let's print backtrace along with kernel source line numbers (source line numbers probably wouldn't be displayed if kernel-debuginfo is not installed):
crash> bt -l PID: 7224 TASK: ffff880037b70ab0 CPU: 1 COMMAND: "dd" #0 [ffff88007c473c18] schedule at ffffffff815489d0 /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/kernel/sched.c: 3119 #1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c: 677 #2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14 /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 300 #3 [ffff88007c473db0] do_filp_open at ffffffff811aca99 /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namei.c: 2267 #4 [ffff88007c473f20] do_sys_open at ffffffff811969f7 /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c: 908 #5 [ffff88007c473f70] sys_open at ffffffff81196b00 /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c: 932 #6 [ffff88007c473f80] system_call_fastpath at ffffffff8100b0d2 /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/arch/x86/kernel/entry_64.S: 492 RIP: 000000358a8db480 RSP: 00007ffcd7d37ca8 RFLAGS: 00010246 RAX: 0000000000000002 RBX: ffffffff8100b0d2 RCX: 0000000000000000 RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007ffcd7d3870b RBP: 0000000000000240 R8: 000000358ab8ee40 R9: 00007ffaa33c368c R10: 1999999999999999 R11: 0000000000000246 R12: ffffffff81196b00 R13: ffff88007c473f78 R14: ffffffffffffffff R15: 0000000000000064 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
This backtrace needs to be read from bottom to top, following chain shows how these functions were called:
921 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode) 922 { 923 long ret; 924 925 if (force_o_largefile()) 926 flags |= O_LARGEFILE; 927 928 ret = do_sys_open(AT_FDCWD, filename, flags, mode);-------------------- | ---------------------------------------------------------------- | 899 long do_sys_open(int dfd, const char __user *filename, int flags, int mode) 900 { 901 struct filename *tmp = getname(filename); 902 int fd = PTR_ERR(tmp); 903 904 if (!IS_ERR(tmp)) { 905 fd = get_unused_fd_flags(flags); 906 if (fd >= 0) { 907 struct file *f = do_filp_open(dfd, tmp, flags, mode, 0); ------ | --------------------------------------------------------- | 2166 struct file *do_filp_open(int dfd, struct filename *filename, 2167 int open_flag, int mode, int acc_mode) 2168 { 2169 struct file *filp; .. .. 2256 /* 2257 * This write is needed to ensure that a 2258 * ro->rw transition does not occur between 2259 * the time when the file is created and when 2260 * a permanent write count is taken through 2261 * the 'struct file' in nameidata_to_filp(). 2262 */ 2263 error = mnt_want_write(nd.path.mnt); ---------------------------------- | ----------------------------------------------------------------- | 295 int mnt_want_write(struct vfsmount *m) 296 { 297 int ret; 298 299 sb_start_write(m->mnt_sb); ------------------------------------------- | ------------------------------------------------- | 1569 static inline void sb_start_write(struct super_block *sb) 1570 { 1571 __sb_start_write(sb, SB_FREEZE_WRITE, true); -------------------------- | ---------------------------------------------------------------- | 668 int __sb_start_write(struct super_block *sb, int level, bool wait) 669 { 670 /* Out of tree modules don't use this mechanism */ 671 if (unlikely(!sb_has_new_freeze(sb))) 672 return 1; 673 retry: 674 if (unlikely(sb->s_writers.frozen >= level)) { 675 if (!wait) 676 return 0; 677 wait_event(sb->s_writers.wait_unfrozen, <<<<<<<<<<<<<<< dd was here 678 sb->s_writers.frozen < level); 679 }
In short, mnt_want_write() called sb_start_write() which then called
__sb_start_write(), and dd was waiting at line 677 in __sb_start_write() when
this vmcore was collected.
From this code it is clear that dd was trying to write to a filesystem but somehow it couldn't and became uninterruptible, now we need to find out which filesystem it was and why dd couldn't write to it. Can we do it? Yes, absolutely, if we can determine the first argument i.e. super_block that is passed to __sb_start_write(), it will tell us about the filesystem/mount-point.
If we go back to backtrace, we can see that mnt_want_write() called __sb_start_write() and passed it three parameters, struct super_block (sb), level (SB_FREEZE_WRITE) and a Boolean called wait (true):
#1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc #2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14
Note: "whatis command can be used to check the function prototype (simply entering the function name or px also does the job):
crash> whatis __sb_start_write int __sb_start_write(struct super_block *, int, bool);
Before we go any further, it is important to know how arguments are passed to the called function. First argument goes to %rdi, second to %rsi, third to %rdx, fourth to %rcx and fifth and sixth to %r8 and %r9 respectively. Armed with this knowledge now we can turn our attention to disassembly. Let's disassemble mnt_start_write to find out arguments passed to __sb_start_write:
crash> dis -lr ffffffff811bcc14 /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 296 0xffffffff811bcbf0 <mnt_want_write>: push %rbp 0xffffffff811bcbf1 <mnt_want_write+1>: mov %rsp,%rbp 0xffffffff811bcbf4 <mnt_want_write+4>: push %rbx 0xffffffff811bcbf5 <mnt_want_write+5>: sub $0x18,%rsp 0xffffffff811bcbf9 <mnt_want_write+9>: nopl 0x0(%rax,%rax,1) /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/include/linux/fs.h: 1571 0xffffffff811bcbfe <mnt_want_write+14>: mov $0x1,%edx <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 3rd argument i.e. wait 0xffffffff811bcc03 <mnt_want_write+19>: mov $0x1,%esi <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2nd argument i.e. SB_FREEZE_WRITE /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 296 0xffffffff811bcc08 <mnt_want_write+24>: mov %rdi,%rbx /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/include/linux/fs.h: 1571 0xffffffff811bcc0b <mnt_want_write+27>: mov 0x28(%rdi),%rdi <<<<<<<<<<<<<<<<<<<<<<<<<<<<< first argument i.e. super_block (sb) 0xffffffff811bcc0f <mnt_want_write+31>: callq 0xffffffff8119bd20 <__sb_start_write> /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 300 0xffffffff811bcc14 <mnt_want_write+36>: mov %rbx,%rdi
The values of second and third arguments i.e. SB_FREEZE_WRITE and wait are pretty straightforward, both are set to 1.
But to determine super_block which is %rdi we have to back-peddle a little bit, 28th offset of %rdi was copied to %rdi:
mov 0x28(%rdi),%rdi
we need to determine what's inside %rdi and one line above that tells us that %rdi was copied to %rbx.
mov %rdi,%rbx
If we can determine value inside %rbx, it will give us %rdi and then 28th offset of %rdi can be calculated.
To obtain %rdi let's disassemble __sb_start_write and see if %rbx is saved somewhere:
crash> dis -lr ffffffff8119bdfc | head /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c: 669 0xffffffff8119bd20 <__sb_start_write>: push %rbp 0xffffffff8119bd21 <__sb_start_write+1>: mov %rsp,%rbp 0xffffffff8119bd24 <__sb_start_write+4>: add $0xffffffffffffff80,%rsp 0xffffffff8119bd28 <__sb_start_write+8>: mov %rbx,-0x28(%rbp) <<<<<<< rbx 0xffffffff8119bd2c <__sb_start_write+12>: mov %r12,-0x20(%rbp) 0xffffffff8119bd30 <__sb_start_write+16>: mov %r13,-0x18(%rbp) 0xffffffff8119bd34 <__sb_start_write+20>: mov %r14,-0x10(%rbp) 0xffffffff8119bd38 <__sb_start_write+24>: mov %r15,-0x8(%rbp) 0xffffffff8119bd3c <__sb_start_write+28>: nopl 0x0(%rax,%rax,1)
%rbx is copied to -28th offset of %rbp, and %rbp can be found either in the stack data of __sb_start_write or schedule, I am going to use stack data of schedule for now to make it simpler. Disassembly of schedule (next function after __sb_start_write) suggests that %rbp is pushed to the stack:
crash> dis -lr ffffffff815489d0 | head /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/kernel/sched.c: 6133 0xffffffff81548630 <schedule>: push %rbp <<<<<<<<<<<<<<<<<<< 1st push 0xffffffff81548631 <schedule+1>: mov %rsp,%rbp 0xffffffff81548634 <schedule+4>: push %r15 0xffffffff81548636 <schedule+6>: push %r14 0xffffffff81548638 <schedule+8>: push %r13 0xffffffff8154863a <schedule+10>: push %r12 0xffffffff8154863c <schedule+12>: push %rbx 0xffffffff8154863d <schedule+13>: sub $0x98,%rsp 0xffffffff81548644 <schedule+20>: nopl 0x0(%rax,%rax,1)
first push at stack will give us %rbp, and bt -f can be used to print stack:
crash> bt -f | grep -A15 schedule #0 [ffff88007c473c18] schedule at ffffffff815489d0 ffff88007c473c20: 0000000000000082 0000000000000000 ffff88007c473c30: 0000000000000000 0000000000000000 ffff88007c473c40: 0000000000000000 0000000000000000 ffff88007c473c50: 0000000000000000 0000000000000000 ffff88007c473c60: 0000000000000000 ffff880037b71068 ffff88007c473c70: ffff88007c473fd8 0000000000010068 ffff88007c473c80: ffff880037b71068 ffff88007e527520 ffff88007c473c90: ffff880037b70ab0 0000000000000002 ffff88007c473ca0: 0000000000000246 ffff88007c473ce8 ffff88007c473cb0: ffffffff810a6bce ffff880037b70ab0 ffff88007c473cc0: ffff8800379e7400 0000000000000001 ffff88007c473cd0: ffff88007c473d18 ffff8800379e76d8 ffff88007c473ce0: ffff8800379e7770 ffff88007c473d78 <<<< ffff88007c473d78 is first push == %rbp ffff88007c473cf0: ffffffff8119bdfc
Now that we have %rbp, %rbx can be calculated i.e. -0x28th offset of %rbp:
crash> eval (0xffff88007c473d78 - 0x28) hexadecimal: ffff88007c473d50 decimal: 18446612134399262032 (-131939310289584) octal: 1777774200017421636520 binary: 1111111111111111100010000000000001111100010001110011110101010000 crash> rd ffff88007c473d50 ffff88007c473d50: ffff88007ef64380 .C.~....
%rbx == ffff88007ef64380
Since %rdi was copied to %rbx, %rdi also is ffff88007ef64380, now if we calculate 28th offset to %rdi we will get super_block:
crash> eval (0xffff88007ef64380 + 0x28) hexadecimal: ffff88007ef643a8 decimal: 18446612134444286888 (-131939265264728) octal: 1777774200017675441650 binary: 1111111111111111100010000000000001111110111101100100001110101000 crash> rd ffff88007ef643a8 ffff88007ef643a8: ffff8800379e7400 .t.7....
super_block == ffff8800379e7400.
super_block can be passed to the "mount" command to get the mount point:
crash> mount ffff8800379e7400 VFSMOUNT SUPERBLK TYPE DEVNAME DIRNAME ffff88007ef64380 ffff8800379e7400 ext4 /dev/vda1 /boot
This tells us that that dd was trying to write to /boot. Next part of our job is to find out why couldn't it write to /boot. In order to answer that, let's take a step back and revisit the kernel code at line 677 (where dd was):
677 wait_event(sb->s_writers.wait_unfrozen, 678 sb->s_writers.frozen < level);
Following wait_event() definition tells us that dd was put to sleep (uninterruptible) and will be woken up when "sb->s_writers.frozen < level" condition becomes true:
/** * wait_event - sleep until a condition gets true * @wq: the waitqueue to wait on * @condition: a C expression for the event to wait for * * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the * @condition evaluates to true. The @condition is checked each time * the waitqueue @wq is woken up. * * wake_up() has to be called after changing any variable that could * change the result of the wait condition. */ #define wait_event(wq, condition) \ do { \ if (condition) \ break; \ __wait_event(wq, condition); \ } while (0)
We know that "level" is 1 (second argument SB_FREEZE_WRITE), and sb->s_writers.frozen needs to be less than 1 for this condition to become true. Let's see what exactly is the current value of sb->s_writers.frozen, since we already have sb, it is pretty straightforward to obtain this value:
crash> pd ((struct super_block *)0xffff8800379e7400)->s_writers.frozen $5 = 4
Or do this (I prefer the 1st one):
crash> super_block.s_writers ffff8800379e7400 | grep frozen frozen = 4,
sb->s_writers.frozen is 4 which obviously is lower than 1 and this is why dd is uninterruptible, but what does this value 4 mean? This is defined in include/linux/fs.h ("cscope" can be used to examine kernel source):
/* Possible states of 'frozen' field */ enum { SB_UNFROZEN = 0, /* FS is unfrozen */ SB_FREEZE_WRITE = 1, /* Writes, dir ops, ioctls frozen */ SB_FREEZE_TRANS = 2, SB_FREEZE_PAGEFAULT = 2, /* Page faults stopped as well */ SB_FREEZE_FS = 3, /* For internal FS use (e.g. to stop * internal threads if needed) */ SB_FREEZE_COMPLETE = 4, /* ->freeze_fs finished successfully */ <<<<<<< };
4 is SB_FREEZE_COMPLETE, which means that filesystem i.e. /boot is frozen. So, now we can say that dd became uninterruptible because /boot filesystem was frozen, this triggered hung_task_timeout_secs which was caught by hung_task_panic and this caused the crash.
Next step should be to check what exactly froze the filesystem, in this particular case I did it using fsfreeze to collect this vmcore so I can demonstrate a few things about vmcore analysis.
In order to collect this demo vmcore I used a sysctl parameter named hung_task_panic. This parameter when enabled will trigger a crash if a process becomes uninterruptible for more than 120 seconds. hung_task_timeout_secs sysctl parameter controls this duration and is set to 120 seconds by default in RHEL 6/7. After enabling hung_task_panic and reducing hung_task_timeout_secs to 10 seconds:
# echo 1 > /proc/sys/kernel/hung_task_panic # echo 10 > /proc/sys/kernel/hung_task_timeout_secs
I simply executed following dd command on /boot:
# dd if=/dev/zero of=/boot/test
And in the second terminal this:
# fsfreeze -f /boot
This made dd uninterruptible, hung_task_panic kicked in and kdump did its job by saving a vmcore under /var/crash.
We can print these values from vmcore:
crash> p sysctl_hung_task_timeout_secs sysctl_hung_task_timeout_secs = $1 = 10
crash> p sysctl_hung_task_panic sysctl_hung_task_panic = $2 = 1
Please send an email to the kernel meetup list for any queries and if this write-up needs any corrections.