Re: Need help to debug freeze on kernel side (somehow related to lxc) [message #41906] |
Fri, 05 November 2010 12:41 |
pva
Messages: 2 Registered: October 2010
|
Junior Member |
|
|
Thank you Matt, for your help!
I've changed a bit subject to make it more clear that lxc freezer itself
have no relation (I've did checks you provided, just to be sure). Now
server freezed again and I had some time to gather a bit of information.
Yet still I'm unsure what to do with this freeze.
By freeze I mean that `ps aux` output freeze at some point and I'm
unable to kill it with ctrl+C. strace pointed that it hangs on
reading /proc/3780/cmdline file (environ file is unreadable too). exe
symlink pointed on /usr/sbin/sshd and this time I was unable to ssh,
while previously it was possible (so different processes occur in the
same situation from time to time). This process does not belongs to
cgroup (it's in / cgroup). kill/kill -9 3780 did nothing. I've tried to
gather more proc information from /proc/3780 (in attachment), also there
is kern.log with some sysrq information (memory info, kernel dump and
similar). Could you help me to see what other information could be of
interest here? How to find out where sshd hanged and why? I
thought /proc/3780/syscall could help here, but I failed to find what
this file has inside and numbers there are not addresses of functions in
System.map (or grep was unable to find them). Any suggestions, please?
With best regards,
--
Peter.
В Сбт, 30/10/2010 в 17:36 -0700, Matt Helsley пишет:
> On Fri, Oct 29, 2010 at 04:27:40PM +0400, Пётр Волков wrote:
> > Hi. We are using lxc to separate different services into containers: for
> > this discussions we have apache+php, mysql, nginx containers to serve
> > our web application. After upgrade (I think from kernel 2.6.32 into
> > something newer, now we are using 2.6.35, but tried 34 too) we've
> > experience following issue: at some point nginx starts to show us "504
> > Gateway Time-out" error and while it is possible to ssh on server `ps
> > aux` hangs (with no ability to stop it), it is impossible to restart
> > apache container (hangs on stop) and the only way to fix this is to
> > restart server using sysrq or power button. At the same time there is
> > nothing in the logs. I suspect apache starts to eat lots of memory but
> > oom killer somehow freezes container but I don't have any proves. What
>
> The OOM killer does not freeze tasks. Now if the tasks were already
> frozen and if the OOM killer selected them then I can see how that
> would be a problem. However, again I doubt that's what's happening here
> for several reasons.
>
> 1. lxc doesn't arbitrarily freeze tasks -- unless you were checkpointing
> or freezing the task yourself (or using a custom script to do
> so), the tasks in the container's cgroup should not be frozen.
>
> 2. If the task(s) are frozen then by definition they are not allocating
> memory. At best they're pinning the memory they've already
> allocated before being frozen. [ The tasks will respond to
> kill signals when thawed. ]
>
> > could you suggest to debug this issue? What sysrq information could be
> > useful here?
>
> [ Cc'ing lxc-users@lists.sf.net for lxc-specific debugging ideas/advice. ]
>
> Here's some info on collecting and diagnosing the state of the freezer
> so that hopefully we can eliminate your concerns about it being invovled
> and confirm what I've said above:
>
> If you want to figure out if the cgroup freezer is involved at all
> debugging it requires that you be in the "host". Find out which
> process ids are your apache/nginx/etc processes. Then look at their
> cgroups in /proc/<pid>/cgroup. Keep in mind that the "/" in those
> paths isn't the same as "/" -- it's the directory the cgroup
> subsystems are mounted at (see /proc/mounts to figure out where).
> You want the line that says "freezer".
>
> Look at the cgroups mount point with the freezer subssystem in the
> cgroup(s) of these processes (it'll say "freezer" in the mount options).
> Confirm that your pids are listed in the cgroup by looking at the tasks
> file.
>
> If the freezer.state file of those cgroups contains the word "THAWED"
> then the problem lies elsewhere. If the freezer.state says "FREEZING"
> or "FROZEN" however then you'll want to look at the state of the
> processes. Some or all should be in the "D" state while "FREEZING".
> All should be in "D" state while "FROZEN".
>
> "FREEZING" is an intermediate state however so it's not possible to
> determine if there's a bug based purely on the info collected so far.
> The best you can do with "FREEZING" is try and write "FROZEN" into
> freezer.state one or more times and see if it 'eventually' succeeds
> -- say within 10 seconds or 20 attempts, whichever takes longer.
> If it doesn't then you need to strace the processes and see if any
> are stuck in a syscall -- vfork perhaps. You can also try writing
> "THAWED". If it doesn't thaw on the first try then there's a bug.
>
> Whenever you write a new state to freezer.state you should read the
> file again to find out whether the state change took place. Some
> transitions are handled lazily and only take place when you ask for
> the state by reading it.
>
> That's the way to figure out if the freezer is involved and, if so,
> where it's stuck.
>
> Cheers,
> -Matt Helsley
1:blkio,freezer,devices,memory,cpuacct,cpu,ns,debug,cpuset:/
7d0bc56000-7d0bcca000 r-xp 00000000 fe:00 221 /usr/sbin/sshd
7d0bec9000-7d0becb000 r--p 00073000 fe:00 221 /usr/sbin/sshd
7d0becb000-7d0becc000 rw-p 00075000 fe:00 221 /usr/sbin/sshd
7d0becc000-7d0befd000 rw-p 00000000 00:00 0 [heap]
312a1beb000-312a1bf6000 r-xp 00000000 08:03 7686 /lib64/libnss_files-2.12.1.so (deleted)
312a1bf6000-312a1df6000 ---p 0000b000 08:03 7686 /lib64/libnss_files-2.12.1.so (deleted)
312a1df6000-312a1df7000 r--p 0000b000 08:03 7686 /lib64/libnss_files-2.12.1.so (deleted)
312a1df7000-312a1df8000 rw-p 0000c000 08:03 7686 /lib64/libnss_files-2.12.1.so (deleted)
312a1df8000-312a1e02000 r-xp 00000000 08:03 7684 /lib64/libnss_nis-2.12.1.so (deleted)
312a1e02000-312a2001000 ---p 0000a000 08:03 7684 /lib64/libnss_nis-2.12.1.so (deleted)
312a2001000-312a2002000 r--p 00009000 08:03 7684 /lib64/libnss_nis-2.12.1.so (deleted)
312a2002000-312a2003000 rw-p 0000a000 08:03 7684 /lib64/libnss_nis-2.12.1.so (deleted)
312a2003000-312a2018000 r-xp 00000000 08:03 7687 /lib64/libnsl-2.12.1.so (deleted)
312a2018000-312a2217000 ---p 00015000 08:03 7687 /lib64/libnsl-2.12.1.so (deleted)
312a2217000-312a2218000 r--p 00014000 08:03 7687 /lib64/libnsl-2.12.1.so (deleted)
312a2218000-312a2219000 rw-p 00015000 08:03 7687 /lib64/libnsl-2.12.1.so (deleted)
312a2219000-312a221b000 rw-p 00000000 00:00 0
312a221b000-312a2222000 r-xp 00000000 08:03 7591 /lib64/libnss_compat-2.12.1.so (deleted)
312a2222000-312a2421000 ---p 00007000 08:03 7591 /lib64/libnss_compat-2.12.1.so (deleted)
312a2421000-312a2422000 r--p 00006000 08:03 7591 /lib64/libnss_compat-2.12.1.so (deleted)
312a2422000-312a2423000 rw-p 00007000 08:03 7591 /lib64/libnss_compat-2.12.1.so (deleted)
312a2423000-312a2425000 r-xp 00000000 08:03 7682 /lib64/libdl-2.12.1.so (deleted)
312a2425000-312a2625000 ---p 00002000 08:03 7682 /lib64/libdl-2.12.1.so (deleted)
312a2625000-312a2626000 r--p 00002000 08:03 7682 /lib64/libdl-2.12.1.so (deleted)
312a2626000-312a2627000 rw-p 00003000 08:03 7682 /lib64/libdl-2.12.1.so (deleted)
312a2627000-312a2784000 r-xp 00000000 08:03 7689 /lib64/libc-2.12.1.so (deleted)
312a2784000-312a2983000 ---p 0015d000 08:03 7689 /lib64/libc-2.12.1.so (deleted)
312a2983000-312a2987000 r--p 0015c000 08:03 7689 /lib64/libc-2.12.1.so (deleted)
312a2987000-312a2988000 rw-p 00160000 08:03 7689 /lib64/libc-2.12.1.so (deleted)
312a2988000-312a298d000 rw-p 00000000 00:00 0
312a298d000-312a2995000 r-xp 00000000 08:03 7473 /lib64/libcrypt-2.12.1.so (deleted)
312a2995000-312a2b94000 ---p 00008000 08:03 7473 /lib64/libcrypt-2.12.1.so (deleted)
312a2b94000-312a2b95000 r--p 00007000 08:03 7473 /lib64/libcrypt-2.12.1.so (deleted)
312a2b95000-312a2b96000 rw-p 00008000 08:03 7473 /lib64/libcrypt-2.12.1.so (deleted)
312a2b96000-312a2bc4000 rw-p 00000000 00:00 0
312a2bc4000-312a2bc6000 r-xp 00000000 08:03 7483 /lib64/libutil-2.12.1.so (deleted)
312a2bc6000-312a2dc5000 ---p 00002000 08:03 7483 /lib64/libutil-2.12.1.so (deleted)
312a2dc5000-312a2dc6000 r--p 00001000 08:03 7483 /lib64/libutil-2.12.1.so (deleted)
312a2dc6000-312a2dc7000 rw-p 00002000 08:03 7483 /lib64/libutil-2.12.1.so (deleted)
312a2dc7000-312a2ddf000 r-xp 00000000 08:03 86 /lib64/libz.so.1.2.5
312a2ddf000-312a2fde000 ---p 00018000 08:03 86 /lib64/libz.so.1.2.5
312a2fde000-312a2fdf000 r--p 00017000 08:03 86 /lib64/libz.so.1.2.5
312a2fdf000-312a2fe0000 r
...
-
Attachment: 3780-cgroup
(Size: 0.06KB, Downloaded 433 times)
-
Attachment: 3780-maps
(Size: 6.14KB, Downloaded 445 times)
-
Attachment: 3780-sched
(Size: 2.37KB, Downloaded 444 times)
-
Attachment: 3780-schedstat
(Size: 0.01KB, Downloaded 465 times)
-
Attachment: 3780-stat
(Size: 0.21KB, Downloaded 481 times)
-
Attachment: 3780-statm
(Size: 0.02KB, Downloaded 464 times)
-
Attachment: 3780-status
(Size: 0.72KB, Downloaded 457 times)
-
Attachment: 3780-syscall
(Size: 0.03KB, Downloaded 474 times)
-
Attachment: kern.log
(Size: 129.49KB, Downloaded 478 times)
|
|
|