Kernel crash dump session write-up - April 2017 meetup

This is a write-up of the kernel crash dump session contributed by Nitin Bansal. The original email is here.

I attended the Pune Kernel Meetup on 8th April and presented a session on Kernel crash dump (vmcore) analysis. In this write-up I have tried to explain a few things about the vmcore analysis, hope it helps.

Before starting with the vmcore analysis, we need following three things:

crash utility [1]
vmlinux
vmcore

vmlinux can be obtained from kernel-debuginfo (for Red Hat Enterprise Linux (RHEL) at least). One can either install kernel-debuginfo or simply extract vmlinux file using rpm2cpio:

# yum install kernel-debuginfo-2.6.32-642.13.1.el6.x86_64

Or

# rpm2cpio kernel-debuginfo-2.6.32-642.13.1.el6.x86_64.rpm | cpio -idv ./usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux

vmcore was collected from kernel-2.6.32-642.13.1.el6 so vmlinux from debuginfo package of same version is required.

Now let's use "crash" utility to open this vmcore, which after printing a few details i.e. copyright information, GDB version information etc drops us to "crash>" prompt:

# crash /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux  vmcore

crash 7.1.0-6.el6
..
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 2
        DATE: Fri Mar 31 11:15:47 2017
      UPTIME: 00:06:01
LOAD AVERAGE: 0.40, 0.30, 0.13
       TASKS: 232
    NODENAME: rhel6
     RELEASE: 2.6.32-642.13.1.el6.x86_64
     VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016
     MACHINE: x86_64  (2808 Mhz)
      MEMORY: 2 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
         PID: 46
     COMMAND: "khungtaskd"
        TASK: ffff880078d16ab0  [THREAD_INFO: ffff880078d20000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash>

"sys" command can be used to print most of the information that crash prints when invoked e.g.:

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 2
        DATE: Fri Mar 31 11:15:47 2017
      UPTIME: 00:06:01
LOAD AVERAGE: 0.40, 0.30, 0.13
       TASKS: 232
    NODENAME: rhel6
     RELEASE: 2.6.32-642.13.1.el6.x86_64
     VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016
     MACHINE: x86_64  (2808 Mhz)
      MEMORY: 2 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"

To list all the available crash commands, type "?" and hit enter:

crash> ?

*              files          mach           repeat         timer          
alias          foreach        mod            runq           tree           
ascii          fuser          mount          search         union          
bt             gdb            net            set            vm             
btop           help           p              sig            vtop           
dev            ipcs           ps             struct         waitq          
dis            irq            pte            swap           whatis         
eval           kmem           ptob           sym            wr             
exit           list           ptov           sys            q              
extend         log            rd             task

Use "help" to get details about a command and parameters it accepts e.g.:

crash> help sys

Now, let's examine kernel ring buffer (that's where we find panic message) with the help of "log" command.

And following is the panic message, which tells us that khungtaskd crashed the system:

Call Trace:
 <#DB[1]>  <<EOE>> Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1
Call Trace:
 <NMI>  [<ffffffff81009c47>] ? show_regs+0x27/0x30
 [<ffffffff8154d909>] ? arch_trigger_all_cpu_backtrace_handler+0x99/0xc0
 [<ffffffff8154f1a5>] ? notifier_call_chain+0x55/0x80
 [<ffffffff8154f20a>] ? atomic_notifier_call_chain+0x1a/0x20
 [<ffffffff810acc8e>] ? notify_die+0x2e/0x30
 [<ffffffff8154ce23>] ? do_nmi+0x1c3/0x350
 [<ffffffff8154c6e3>] ? nmi+0x83/0x90
 [<ffffffff8104615a>] ? native_write_msr_safe+0xa/0x10
 <<EOE>>  [<ffffffff8103e92c>] ? x2apic_send_IPI_mask+0x6c/0xb0
 [<ffffffff81039b9a>] ? arch_trigger_all_cpu_backtrace+0x27a/0x2c0
 [<ffffffff810f2740>] ? watchdog+0x250/0x2a0
 [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0
 [<ffffffff810a640e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff810a6370>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1
Call Trace:
 [<ffffffff81548301>] ? panic+0xa7/0x179
 [<ffffffff81039aa5>] ? arch_trigger_all_cpu_backtrace+0x185/0x2c0
 [<ffffffff810f274e>] ? watchdog+0x25e/0x2a0
 [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0
 [<ffffffff810a640e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff810a6370>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

khungtaskd is a kernel thread which checks for uninterruptible tasks, few relevant snips from kernel/hung_task.c:

static int __init hung_task_init(void)
{
        atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
        watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");

        return 0;
}

..

static int watchdog(void *dummy)
{
        set_user_nice(current, 0); 
 
        for ( ; ; ) {
                unsigned long timeout = sysctl_hung_task_timeout_secs;      

                while (schedule_timeout_interruptible(timeout_jiffies(timeout)))
                        timeout = sysctl_hung_task_timeout_secs; 

                check_hung_uninterruptible_tasks(timeout);
        } 

        return 0;
}

By scrolling back a bit in ring buffer a hung_task_timeout_secs warning can be seen along with the following backtrace:

1 > INFO: task dd:7224 blocked for more than 10 seconds.
2 >      Not tainted 2.6.32-642.13.1.el6.x86_64 #1
3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
4> dd            D 0000000000000001     0  7224   6702 0x00000080
 ffff88007c473ce8 0000000000000082 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 ffff880037b71068 ffff88007c473fd8
5> Call Trace:
 [<ffffffff810a6bce>] ? prepare_to_wait+0x4e/0x80
 [<ffffffff8119bdfc>] __sb_start_write+0xdc/0x120
 [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811bcc14>] mnt_want_write+0x24/0x50
 [<ffffffff8119b348>] ? get_empty_filp+0xe8/0x190
 [<ffffffff811aca99>] do_filp_open+0x2b9/0xd20
 [<ffffffff812a885a>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff811ba072>] ? alloc_fd+0x92/0x160
 [<ffffffff811969f7>] do_sys_open+0x67/0x130
 [<ffffffff81196b00>] sys_open+0x20/0x30
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

Let's look at it line by line:

Line 1 shows that dd PID 7224 was blocked for more than 10 seconds.
Line 2 tells us the kernel version and that it is not tainted (i.e. no third

party modules are loaded etc.).

Line 3 is just telling us how to disable this message, not a great idea IMO.
Line 4 contains command name, state, PID etc.
Line 5 shows the associated call trace which suggests that dd became

uninterruptible while trying to write to a filesystem.

Process details can be checked using "ps":

crash> ps -m | grep UN
[0 00:00:14.117] [UN]  PID: 7224   TASK: ffff880037b70ab0  CPU: 1   COMMAND: "dd"
^^^^^^^^^^^^^^^^
     |_____This is the duration of dd in uninterruptible state (14.117 seconds)

-m switch displays timestamp in days, hours, minutes, seconds and milliseconds.

Next step is to examine dd process, in order to do that, we first need to change the process context (currently it is set to khungtaskd and needs to be switched to dd), which can be done using "set":

crash> set 7224
    PID: 7224
COMMAND: "dd"
   TASK: ffff880037b70ab0  [THREAD_INFO: ffff88007c470000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE

Let's print backtrace along with kernel source line numbers (source line numbers probably wouldn't be displayed if kernel-debuginfo is not installed):

crash> bt -l
PID: 7224   TASK: ffff880037b70ab0  CPU: 1   COMMAND: "dd"
 #0 [ffff88007c473c18] schedule at ffffffff815489d0
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/kernel/sched.c: 3119
 #1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c: 677
 #2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 300
 #3 [ffff88007c473db0] do_filp_open at ffffffff811aca99
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namei.c: 2267
 #4 [ffff88007c473f20] do_sys_open at ffffffff811969f7
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c: 908
 #5 [ffff88007c473f70] sys_open at ffffffff81196b00
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c: 932
 #6 [ffff88007c473f80] system_call_fastpath at ffffffff8100b0d2
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/arch/x86/kernel/entry_64.S: 492
    RIP: 000000358a8db480  RSP: 00007ffcd7d37ca8  RFLAGS: 00010246
    RAX: 0000000000000002  RBX: ffffffff8100b0d2  RCX: 0000000000000000
    RDX: 00000000000001b6  RSI: 0000000000000241  RDI: 00007ffcd7d3870b
    RBP: 0000000000000240   R8: 000000358ab8ee40   R9: 00007ffaa33c368c
    R10: 1999999999999999  R11: 0000000000000246  R12: ffffffff81196b00
    R13: ffff88007c473f78  R14: ffffffffffffffff  R15: 0000000000000064
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

This backtrace needs to be read from bottom to top, following chain shows how these functions were called:

921 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)
922 {
923     long ret;
924 
925     if (force_o_largefile())
926         flags |= O_LARGEFILE;
927 
928     ret = do_sys_open(AT_FDCWD, filename, flags, mode);--------------------
                                                                               | 
               ----------------------------------------------------------------
               |                                         
899 long do_sys_open(int dfd, const char __user *filename, int flags, int mode)
900 {
901     struct filename *tmp = getname(filename);
902     int fd = PTR_ERR(tmp);
903 
904     if (!IS_ERR(tmp)) {
905         fd = get_unused_fd_flags(flags);
906         if (fd >= 0) {
907             struct file *f = do_filp_open(dfd, tmp, flags, mode, 0); ------
                                                                               |
                      ---------------------------------------------------------
                     |
2166 struct file *do_filp_open(int dfd, struct filename *filename,
2167         int open_flag, int mode, int acc_mode)
2168 {
2169     struct file *filp;
..
..
2256     /*
2257      * This write is needed to ensure that a
2258      * ro->rw transition does not occur between
2259      * the time when the file is created and when
2260      * a permanent write count is taken through
2261      * the 'struct file' in nameidata_to_filp().
2262      */
2263     error = mnt_want_write(nd.path.mnt); ----------------------------------
                                                                               |
              -----------------------------------------------------------------
             |
295 int mnt_want_write(struct vfsmount *m)
296 {
297     int ret;
298 
299     sb_start_write(m->mnt_sb);  -------------------------------------------
                                                                               |
                              -------------------------------------------------
                             |
1569 static inline void sb_start_write(struct super_block *sb)
1570 {
1571     __sb_start_write(sb, SB_FREEZE_WRITE, true); --------------------------
                                                                               |
               ----------------------------------------------------------------
              |
668 int __sb_start_write(struct super_block *sb, int level, bool wait)
669 {   
670     /* Out of tree modules don't use this mechanism */
671     if (unlikely(!sb_has_new_freeze(sb)))
672         return 1;
673 retry:
674     if (unlikely(sb->s_writers.frozen >= level)) {
675         if (!wait)
676             return 0;
677         wait_event(sb->s_writers.wait_unfrozen,  <<<<<<<<<<<<<<< dd was here
678                sb->s_writers.frozen < level);
679     }

In short, mnt_want_write() called sb_start_write() which then called __sb_start_write(), and dd was waiting at line 677 in __sb_start_write() when this vmcore was collected.

From this code it is clear that dd was trying to write to a filesystem but somehow it couldn't and became uninterruptible, now we need to find out which filesystem it was and why dd couldn't write to it. Can we do it? Yes, absolutely, if we can determine the first argument i.e. super_block that is passed to __sb_start_write(), it will tell us about the filesystem/mount-point.

If we go back to backtrace, we can see that mnt_want_write() called __sb_start_write() and passed it three parameters, struct super_block (sb), level (SB_FREEZE_WRITE) and a Boolean called wait (true):

#1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc
#2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14

Note: "whatis command can be used to check the function prototype (simply entering the function name or px also does the job):

crash> whatis __sb_start_write
int __sb_start_write(struct super_block *, int, bool);

Before we go any further, it is important to know how arguments are passed to the called function. First argument goes to %rdi, second to %rsi, third to %rdx, fourth to %rcx and fifth and sixth to %r8 and %r9 respectively. Armed with this knowledge now we can turn our attention to disassembly. Let's disassemble mnt_start_write to find out arguments passed to __sb_start_write:

crash> dis -lr ffffffff811bcc14 
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 296
0xffffffff811bcbf0 <mnt_want_write>:    push   %rbp
0xffffffff811bcbf1 <mnt_want_write+1>:  mov    %rsp,%rbp
0xffffffff811bcbf4 <mnt_want_write+4>:  push   %rbx
0xffffffff811bcbf5 <mnt_want_write+5>:  sub    $0x18,%rsp
0xffffffff811bcbf9 <mnt_want_write+9>:  nopl   0x0(%rax,%rax,1)
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/include/linux/fs.h: 1571
0xffffffff811bcbfe <mnt_want_write+14>: mov    $0x1,%edx  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 3rd argument i.e. wait
0xffffffff811bcc03 <mnt_want_write+19>: mov    $0x1,%esi  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2nd argument i.e. SB_FREEZE_WRITE
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 296
0xffffffff811bcc08 <mnt_want_write+24>: mov    %rdi,%rbx 
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/include/linux/fs.h: 1571
0xffffffff811bcc0b <mnt_want_write+27>: mov    0x28(%rdi),%rdi <<<<<<<<<<<<<<<<<<<<<<<<<<<<< first argument i.e. super_block (sb)
0xffffffff811bcc0f <mnt_want_write+31>: callq  0xffffffff8119bd20 <__sb_start_write>
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 300
0xffffffff811bcc14 <mnt_want_write+36>: mov    %rbx,%rdi

The values of second and third arguments i.e. SB_FREEZE_WRITE and wait are pretty straightforward, both are set to 1.

But to determine super_block which is %rdi we have to back-peddle a little bit, 28th offset of %rdi was copied to %rdi:

mov    0x28(%rdi),%rdi

we need to determine what's inside %rdi and one line above that tells us that %rdi was copied to %rbx.

mov    %rdi,%rbx

If we can determine value inside %rbx, it will give us %rdi and then 28th offset of %rdi can be calculated.

To obtain %rdi let's disassemble __sb_start_write and see if %rbx is saved somewhere:

crash> dis -lr ffffffff8119bdfc | head
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c: 669
0xffffffff8119bd20 <__sb_start_write>:	push   %rbp
0xffffffff8119bd21 <__sb_start_write+1>:	mov    %rsp,%rbp
0xffffffff8119bd24 <__sb_start_write+4>:	add    $0xffffffffffffff80,%rsp
0xffffffff8119bd28 <__sb_start_write+8>:	mov    %rbx,-0x28(%rbp)   <<<<<<< rbx
0xffffffff8119bd2c <__sb_start_write+12>:	mov    %r12,-0x20(%rbp)
0xffffffff8119bd30 <__sb_start_write+16>:	mov    %r13,-0x18(%rbp)
0xffffffff8119bd34 <__sb_start_write+20>:	mov    %r14,-0x10(%rbp)
0xffffffff8119bd38 <__sb_start_write+24>:	mov    %r15,-0x8(%rbp)
0xffffffff8119bd3c <__sb_start_write+28>:	nopl   0x0(%rax,%rax,1)

%rbx is copied to -28th offset of %rbp, and %rbp can be found either in the stack data of __sb_start_write or schedule, I am going to use stack data of schedule for now to make it simpler. Disassembly of schedule (next function after __sb_start_write) suggests that %rbp is pushed to the stack:

crash> dis -lr ffffffff815489d0 | head
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/kernel/sched.c: 6133
0xffffffff81548630 <schedule>:	push   %rbp  <<<<<<<<<<<<<<<<<<< 1st push
0xffffffff81548631 <schedule+1>:	mov    %rsp,%rbp
0xffffffff81548634 <schedule+4>:	push   %r15
0xffffffff81548636 <schedule+6>:	push   %r14
0xffffffff81548638 <schedule+8>:	push   %r13
0xffffffff8154863a <schedule+10>:	push   %r12
0xffffffff8154863c <schedule+12>:	push   %rbx
0xffffffff8154863d <schedule+13>:	sub    $0x98,%rsp
0xffffffff81548644 <schedule+20>:	nopl   0x0(%rax,%rax,1)

first push at stack will give us %rbp, and bt -f can be used to print stack:

crash> bt -f | grep -A15 schedule
 #0 [ffff88007c473c18] schedule at ffffffff815489d0
    ffff88007c473c20: 0000000000000082 0000000000000000 
    ffff88007c473c30: 0000000000000000 0000000000000000 
    ffff88007c473c40: 0000000000000000 0000000000000000 
    ffff88007c473c50: 0000000000000000 0000000000000000 
    ffff88007c473c60: 0000000000000000 ffff880037b71068 
    ffff88007c473c70: ffff88007c473fd8 0000000000010068 
    ffff88007c473c80: ffff880037b71068 ffff88007e527520 
    ffff88007c473c90: ffff880037b70ab0 0000000000000002 
    ffff88007c473ca0: 0000000000000246 ffff88007c473ce8 
    ffff88007c473cb0: ffffffff810a6bce ffff880037b70ab0 
    ffff88007c473cc0: ffff8800379e7400 0000000000000001 
    ffff88007c473cd0: ffff88007c473d18 ffff8800379e76d8 
    ffff88007c473ce0: ffff8800379e7770 ffff88007c473d78 <<<< ffff88007c473d78 is first push == %rbp
    ffff88007c473cf0: ffffffff8119bdfc

Now that we have %rbp, %rbx can be calculated i.e. -0x28th offset of %rbp:

crash> eval (0xffff88007c473d78 - 0x28)
hexadecimal: ffff88007c473d50  
    decimal: 18446612134399262032  (-131939310289584)
      octal: 1777774200017421636520
     binary: 1111111111111111100010000000000001111100010001110011110101010000
crash> rd ffff88007c473d50
ffff88007c473d50:  ffff88007ef64380                    .C.~....

%rbx == ffff88007ef64380

Since %rdi was copied to %rbx, %rdi also is ffff88007ef64380, now if we calculate 28th offset to %rdi we will get super_block:

crash> eval (0xffff88007ef64380 + 0x28)
hexadecimal: ffff88007ef643a8  
    decimal: 18446612134444286888  (-131939265264728)
      octal: 1777774200017675441650
     binary: 1111111111111111100010000000000001111110111101100100001110101000
crash> rd ffff88007ef643a8
ffff88007ef643a8:  ffff8800379e7400                    .t.7....

super_block == ffff8800379e7400.

super_block can be passed to the "mount" command to get the mount point:

crash> mount ffff8800379e7400
    VFSMOUNT         SUPERBLK     TYPE   DEVNAME   DIRNAME
ffff88007ef64380 ffff8800379e7400 ext4   /dev/vda1 /boot

This tells us that that dd was trying to write to /boot. Next part of our job is to find out why couldn't it write to /boot. In order to answer that, let's take a step back and revisit the kernel code at line 677 (where dd was):

677         wait_event(sb->s_writers.wait_unfrozen, 
678                sb->s_writers.frozen < level);

Following wait_event() definition tells us that dd was put to sleep (uninterruptible) and will be woken up when "sb->s_writers.frozen < level" condition becomes true:

/**
 * wait_event - sleep until a condition gets true
 * @wq: the waitqueue to wait on
 * @condition: a C expression for the event to wait for
 *
 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
 * @condition evaluates to true. The @condition is checked each time
 * the waitqueue @wq is woken up.
 *
 * wake_up() has to be called after changing any variable that could
 * change the result of the wait condition.
 */
#define wait_event(wq, condition)                   \
do {                                    \
    if (condition)                          \
        break;                          \
    __wait_event(wq, condition);                    \
} while (0)

We know that "level" is 1 (second argument SB_FREEZE_WRITE), and sb->s_writers.frozen needs to be less than 1 for this condition to become true. Let's see what exactly is the current value of sb->s_writers.frozen, since we already have sb, it is pretty straightforward to obtain this value:

crash> pd ((struct super_block *)0xffff8800379e7400)->s_writers.frozen
$5 = 4

Or do this (I prefer the 1st one):

crash> super_block.s_writers ffff8800379e7400 | grep frozen
    frozen = 4,

sb->s_writers.frozen is 4 which obviously is lower than 1 and this is why dd is uninterruptible, but what does this value 4 mean? This is defined in include/linux/fs.h ("cscope" can be used to examine kernel source):

/* Possible states of 'frozen' field */
enum {
    SB_UNFROZEN = 0,        /* FS is unfrozen */
    SB_FREEZE_WRITE = 1,        /* Writes, dir ops, ioctls frozen */
    SB_FREEZE_TRANS = 2,
    SB_FREEZE_PAGEFAULT = 2,    /* Page faults stopped as well */
    SB_FREEZE_FS = 3,       /* For internal FS use (e.g. to stop
                     * internal threads if needed) */
    SB_FREEZE_COMPLETE = 4,     /* ->freeze_fs finished successfully */  <<<<<<<
};

4 is SB_FREEZE_COMPLETE, which means that filesystem i.e. /boot is frozen. So, now we can say that dd became uninterruptible because /boot filesystem was frozen, this triggered hung_task_timeout_secs which was caught by hung_task_panic and this caused the crash.

Next step should be to check what exactly froze the filesystem, in this particular case I did it using fsfreeze to collect this vmcore so I can demonstrate a few things about vmcore analysis.

In order to collect this demo vmcore I used a sysctl parameter named hung_task_panic. This parameter when enabled will trigger a crash if a process becomes uninterruptible for more than 120 seconds. hung_task_timeout_secs sysctl parameter controls this duration and is set to 120 seconds by default in RHEL 6/7. After enabling hung_task_panic and reducing hung_task_timeout_secs to 10 seconds:

# echo 1 > /proc/sys/kernel/hung_task_panic
# echo 10 > /proc/sys/kernel/hung_task_timeout_secs

I simply executed following dd command on /boot:

# dd if=/dev/zero of=/boot/test

And in the second terminal this:

# fsfreeze -f /boot

This made dd uninterruptible, hung_task_panic kicked in and kdump did its job by saving a vmcore under /var/crash.

We can print these values from vmcore:

crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $1 = 10

crash> p sysctl_hung_task_panic
sysctl_hung_task_panic = $2 = 1

Please send an email to the kernel meetup list for any queries and if this write-up needs any corrections.

Kernel crash dump session write-up - April 2017 meetup

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools