Bug #8323
closed[API] Puma hangs forever on a futex, requiring restart
0%
Description
puma process is behaving like the following:
top - 20:22:26 up 52 days, 7:39, 1 user, load average: 2.47, 2.36, 2.42 Threads: 7 total, 1 running, 6 sleeping, 0 stopped, 0 zombie %Cpu(s): 31.0 us, 2.2 sy, 0.0 ni, 65.4 id, 1.3 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem: 28811768 total, 27909092 used, 902676 free, 155872 buffers KiB Swap: 0 total, 0 used, 0 free. 21708188 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ nTH WCHAN COMMAND 2395 www-data 20 0 3310868 856216 11496 R 99.2 3.0 2525:32 7 - ruby 2279 www-data 20 0 3310868 856216 11496 S 0.0 3.0 0:03.06 7 futex_wai+ ruby 2360 www-data 20 0 3310868 856216 11496 S 0.0 3.0 0:23.95 7 poll_sche+ ruby-timer-thr 2404 www-data 20 0 3310868 856216 11496 S 0.0 3.0 0:02.41 7 poll_sche+ ruby 2405 www-data 20 0 3310868 856216 11496 S 0.0 3.0 0:09.23 7 futex_wai+ ruby 2406 www-data 20 0 3310868 856216 11496 S 0.0 3.0 0:00.10 7 poll_sche+ ruby 2534 www-data 20 0 3310868 856216 11496 S 0.0 3.0 1:49.18 7 futex_wai+ ruby
Look at WCHAN
that has nothing in a process that has almost 100% of CPU
I'm comparing this to 2 healthy puma processes in other API server:
top - 20:24:08 up 57 days, 22:03, 1 user, load average: 1.79, 1.64, 1.45 Threads: 7 total, 0 running, 7 sleeping, 0 stopped, 0 zombie %Cpu(s): 18.2 us, 1.4 sy, 0.0 ni, 80.1 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 28811768 total, 28386492 used, 425276 free, 126632 buffers KiB Swap: 0 total, 0 used, 0 free. 22583144 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ nTH WCHAN COMMAND 1880 www-data 20 0 725132 144572 10940 S 11.0 0.5 825:43.18 7 poll_sche+ ruby 1404 www-data 20 0 725132 144572 10940 S 0.0 0.5 0:03.18 7 futex_wai+ ruby 1619 www-data 20 0 725132 144572 10940 S 0.0 0.5 0:04.50 7 poll_sche+ ruby-timer-thr 1886 www-data 20 0 725132 144572 10940 S 0.0 0.5 0:10.08 7 poll_sche+ ruby 1887 www-data 20 0 725132 144572 10940 S 0.0 0.5 1:27.25 7 futex_wai+ ruby 1888 www-data 20 0 725132 144572 10940 S 0.0 0.5 0:00.33 7 poll_sche+ ruby 1935 www-data 20 0 725132 144572 10940 S 0.0 0.5 0:33.05 7 poll_sche+ ruby
top - 20:24:45 up 52 days, 23:37, 1 user, load average: 0.56, 0.70, 0.67 Threads: 7 total, 0 running, 7 sleeping, 0 stopped, 0 zombie %Cpu(s): 2.5 us, 0.8 sy, 0.0 ni, 96.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 28811768 total, 13967572 used, 14844196 free, 162308 buffers KiB Swap: 0 total, 0 used, 0 free. 10219944 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ nTH WCHAN COMMAND 14792 www-data 20 0 928992 263040 11348 S 0.0 0.9 0:02.88 7 futex_wai+ ruby 14876 www-data 20 0 928992 263040 11348 S 0.0 0.9 0:02.19 7 poll_sche+ ruby-timer-thr 14907 www-data 20 0 928992 263040 11348 S 0.0 0.9 413:08.24 7 poll_sche+ ruby 14918 www-data 20 0 928992 263040 11348 S 0.0 0.9 0:05.22 7 poll_sche+ ruby 14919 www-data 20 0 928992 263040 11348 S 0.0 0.9 1:28.96 7 futex_wai+ ruby 14920 www-data 20 0 928992 263040 11348 S 0.0 0.9 0:00.44 7 poll_sche+ ruby 14952 www-data 20 0 928992 263040 11348 S 0.0 0.9 0:25.55 7 poll_sche+ ruby
my feeling is that the master puma thread is doing something that will never ask the kernel to wait for a resources ( something like while true: pass
will cause this)
making 1 core hot and thus delaying any other processes in that box.
There could be the possibility that the "worker" threads are working and if everything goes well the functionality isn't affected at all. (maybe only in the shutdown process).
just another hint that the process isn't calling anything in kernelspace:
wx7k5:~# cat /proc/2395/stack [<ffffffffffffffff>] 0xffffffffffffffff
while the others:
su92l:~# cat /proc/14792/stack [<ffffffff810ede0e>] futex_wait_queue_me+0xde/0x140 [<ffffffff810ee9a2>] futex_wait+0x182/0x290 [<ffffffff810f0a8e>] do_futex+0xde/0x5d0 [<ffffffff810f0ff1>] SyS_futex+0x71/0x150 [<ffffffff817b6dcd>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
(all libraries are stripped so we have no symbols to do a ptrace. :/ )
How can we debug further this? (and avoid the 100%CPU usage)