OpenVZ Forum


Home » General » Support » IBM RAID controller re-mount lock (Using a RAID controller causes a system lock when doing a restore after a mount)
IBM RAID controller re-mount lock [message #36976] Wed, 05 August 2009 14:35 Go to next message
maximilian.csuk is currently offline  maximilian.csuk
Messages: 2
Registered: August 2009
Junior Member
Dear OpenVZ community,

We have a system from IBM (System x3400) with an IBM ServeRAID 8K-I controller on which we are running Debian-Lenny 4.1.2-25. The kernel is openvz-kernel 2.6.26-2-openvz-686.

The containers are located on a separate ext3-partition and are mounted like this:
/dev/drbd0 on /mnt/hq-drbd-stor01 type ext3 (rw)

In particular, we experience a lockup when we try to un-mount and remount the container's partition.

The following steps work:
vzctl chkpnt 77
vzctl stop 77
/etc/init.d/openvz stop
/etc/init.d/openvz start
vzctl restore 77


But these don't work:
vzctl chkpnt 77
vzctl stop 77
/etc/init.d/openvz stop
umount /dev/drbd0
mount /dev/drbd0
/etc/init.d/openvz start
vzctl restore 77

The restore produces the lockup.

The output until it hangs:
Restoring VE ...
Starting VE ...
VE is mounted
        undump...
Adding IP address(es): 10.20.60.20
Setting CPU units: 1000
Configure meminfo: 65536

Nothing more is displayed after this.
The host shows 100% CPU (system-)usage, load goes up to about 10. From our understanding, a container's processes run inside a kernel-process on the host. The container's currently active process (as far as we can tell, it's a different one on any try) takes all the CPU from the host.

OpenVZ-version is 036test001.

We already tried changing the RAID-controller driver in use from aacraid-1.1.5.2449 to aacraid-1.1.5.2467 but without any luck.
Then we tried using USB-disks instead of the dev/drbd0 where the restore worked even after re-mounting, so the issue must lie within the RAID controller.
We also tried to run a DRBD device on two USB-disks instead of using ServeRAID attached disks this constellation also works, so we can assume that DRDB is working correctly

Is this a known issue? If yes, (how) can it be solved?
Is there a way to debug this and provide you and us with more information?

Thank you for your time.,
Maximilian Csuk
ipcenter.at
Re: IBM RAID controller re-mount lock [message #36984 is a reply to message #36976] Thu, 06 August 2009 09:00 Go to previous messageGo to next message
maratrus is currently offline  maratrus
Messages: 1495
Registered: August 2007
Location: Moscow
Senior Member
Hello,

- let's increase the verbose level of vzctl utility. In /etc/vz/vz.conf please set LOG_LEVEL and VERBOSE variables both to 10.

-
Quote:

OpenVZ-version is 036test001.


Sorry, I didn't catch this statement.

-
Quote:

openvz-kernel 2.6.26-2-openvz-686.


Do you have a chance to check the stable rhel5-2.6.18 or at least 2.6.18 kernels? I'm not sure that you do, but nevertheless ...
http://download.openvz.org/kernel/

- Are there any suspicious messages in dmesg or in logs?

-
Quote:

The host shows 100% CPU (system-)usage, load goes up to about 10. From our understanding, a container's processes run inside a kernel-process on the host. The container's currently active process (as far as we can tell, it's a different one on any try) takes all the CPU from the host.


Do you have a chance to get the pid of the process that consumes 100% cpu? What is the output of "/proc/pid/wchan"?

-
It may be useful to gather alt-sysrq-* outputs:
alt-sysrq-
1) "m" - for memory info dump
2) "p" - for registers - several times, please, twice the number of CPUs
3) "a" - for scheduler stat - 3 times
4) "w" - another scheduler info - 3 times
5) "t" - for all processes calltraces. Warning - this is a resource consuming operation. At least - twice.



Re: IBM RAID controller re-mount lock [message #36990 is a reply to message #36984] Fri, 07 August 2009 09:07 Go to previous messageGo to next message
maximilian.csuk is currently offline  maximilian.csuk
Messages: 2
Registered: August 2009
Junior Member
Quote:
let's increase the verbose level of vzctl utility. In /etc/vz/vz.conf please set LOG_LEVEL and VERBOSE variables both to 10.

Done.

Quote:
Do you have a chance to get the pid of the process that consumes 100% cpu? What is the output of "/proc/pid/wchan"?

Yes. The output is "0".

Quote:
Are there any suspicious messages in dmesg or in logs?

I am not sure, but I guess not. Please have a look:
/var/log/vzctl:
2009-08-07T09:40:01+0200 vzctl : VE 77 : VE start in progress...
2009-08-07T09:40:01+0200 vzctl : VE 77 : Restoring completed succesfully
2009-08-07T09:41:28+0200 vzctl : VE 77 : Setting up checkpoint...
2009-08-07T11:41:28+0400 vzctl : VE 77 :        suspend...
2009-08-07T11:41:28+0400 vzctl : VE 77 :        dump...
2009-08-07T11:41:29+0400 vzctl : VE 77 :        kill...
2009-08-07T09:41:29+0200 vzctl : VE 77 : Running: /usr/lib/vzctl/scripts/vps-net_del
2009-08-07T09:41:29+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota stat 77 -f
2009-08-07T09:41:29+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota off 77
2009-08-07T09:41:29+0200 vzctl : VE 77 : VE is unmounted
2009-08-07T09:41:29+0200 vzctl : VE 77 : Checkpointing completed succesfully
2009-08-07T09:41:36+0200 vzctl : VE 77 : Restoring VE ...
2009-08-07T09:41:36+0200 vzctl : VE 77 : Starting VE ...
2009-08-07T09:41:36+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota show 77
2009-08-07T09:41:36+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota on 77 -r 0 -b 1048676 -B 1153124 -i 200100 -I 220100 -e 0 -n 0 -s 0
2009-08-07T09:41:36+0200 vzctl : VE 77 : Mounting root: /var/lib/vz/root/77 /var/lib/vz/private/77
2009-08-07T09:41:36+0200 vzctl : VE 77 : VE is mounted
2009-08-07T09:41:36+0200 vzctl : VE 77 : Adding IP address(es): 10.20.60.20
2009-08-07T09:41:36+0200 vzctl : VE 77 : Running: /usr/lib/vzctl/scripts/vps-net_add
2009-08-07T09:41:38+0200 vzctl : VE 77 : Setting CPU units: 1000
2009-08-07T09:41:38+0200 vzctl : VE 77 : Configure meminfo: 65536
2009-08-07T09:41:38+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota stat 77 -f
2009-08-07T09:41:38+0200 vzctl : VE 77 : Running: vzquota setlimit 77 -b 1048576 -B 1153024 -i 200000 -I 220000 -e 0 -n 0
2009-08-07T09:41:38+0200 vzctl : VE 77 : VE start in progress...
2009-08-07T09:41:38+0200 vzctl : VE 77 : Restoring completed succesfully
2009-08-07T09:42:11+0200 vzctl : VE 77 : Setting up checkpoint...
2009-08-07T11:42:11+0400 vzctl : VE 77 :        suspend...
2009-08-07T11:42:11+0400 vzctl : VE 77 :        dump...
2009-08-07T11:42:11+0400 vzctl : VE 77 :        kill...
2009-08-07T09:42:11+0200 vzctl : VE 77 : Running: /usr/lib/vzctl/scripts/vps-net_del
2009-08-07T09:42:11+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota stat 77 -f
2009-08-07T09:42:11+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota off 77
2009-08-07T09:42:11+0200 vzctl : VE 77 : VE is unmounted
2009-08-07T09:42:11+0200 vzctl : VE 77 : Checkpointing completed succesfully
2009-08-07T09:42:12+0200 vzctl : VE 77 : Restoring VE ...
2009-08-07T09:42:12+0200 vzctl : VE 77 : Starting VE ...
2009-08-07T09:42:12+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota show 77
2009-08-07T09:42:12+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota on 77 -r 0 -b 1048676 -B 1153124 -i 200100 -I 220100 -e 0 -n 0 -s 0
2009-08-07T09:42:12+0200 vzctl : VE 77 : Mounting root: /var/lib/vz/root/77 /var/lib/vz/private/77
2009-08-07T09:42:12+0200 vzctl : VE 77 : VE is mounted
2009-08-07T09:42:12+0200 vzctl : VE 77 : Adding IP address(es): 10.20.60.20
2009-08-07T09:42:12+0200 vzctl : VE 77 : Running: /usr/lib/vzctl/scripts/vps-net_add
2009-08-07T09:42:14+0200 vzctl : VE 77 : Setting CPU units: 1000
2009-08-07T09:42:14+0200 vzctl : VE 77 : Configure meminfo: 65536
2009-08-07T09:42:14+0200 vzctl : VE 77 : Running: /usr/sbin/vzquota stat 77 -f
2009-08-07T09:42:14+0200 vzctl : VE 77 : Running: vzquota setlimit 77 -b 1048576 -B 1153024 -i 200000 -I 220000 -e 0 -n 0

Nothing more after this...

In dmesg, shortly before the lockup, it says:
 [74677.049342] CT: 77: started

But then, there are more and more messages showing that the particular pid is hanging. In this try, it was sendmail-mta, but others have been the cause too (like apache):
[75007.309371] BUG: soft lockup - CPU#3 stuck for 61s! [sendmail-mta:13227]
[75007.309371] Modules linked in: vzethdev vznetdev simfs vzrst vzcpt tun vzmon xt_tcpudp xt_length ipt_ttl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables x_tables vzdquota vzdev drbd cn bridge ipv6 bonding loop iTCO_wdt i2c_i801 rng_core pcspkr shpchp parport_pc parport i2c_core i5000_edac edac_core pci_hotplug container button joydev evdev ext3 jbd mbcache usbhid hid ff_memless usb_storage sg sr_mod cdrom 8139too ata_piix ata_generic libata dock ses enclosure sd_mod 8139cp mii ide_pci_generic ide_core ehci_hcd uhci_hcd usbcore tg3 e1000e aacraid scsi_mod thermal processor fan thermal_sys [last unloaded: simfs]
[75007.309371]
[75007.309371] Pid: 13227, comm: sendmail-mta Not tainted (2.6.26-2-openvz-686 #1 036test001)
[75007.309371] EIP: 0060:[<c013d31f>] EFLAGS: 00000257 CPU: 3
[75007.309371] EIP is at up_read+0x7/0xf
[75007.309371] EAX: f7996bf8 EBX: f156121c ECX: 00000000 EDX: 00000001
[75007.309371] ESI: f7996bf8 EDI: f7996bc0 EBP: 00000090 ESP: f148fc2c
[75007.309371]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[75007.309371] CR0: 8005003b CR2: b7856000 CR3: 37c3e000 CR4: 000006f0
[75007.309371] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[75007.309371] DR6: ffff0ff0 DR7: 00000400
[75007.309371]  [<c011b3f4>] ? do_page_fault+0x50e/0x8c6
[75007.309371]  [<c0131599>] ? run_timer_softirq+0x30/0x1d1
[75007.309371]  [<c012dff2>] ? __do_softirq+0x9b/0x14e
[75007.309371]  [<c011aee6>] ? do_page_fault+0x0/0x8c6
[75007.309371]  [<c02c5672>] ? error_code+0x72/0x78
[75007.309371]  [<c02c00d8>] ? init_cyrix+0x168/0x46b
[75007.309371]  [<c0161b69>] ? file_read_actor+0x2b/0xcc
[75007.309371]  [<c016398d>] ? generic_file_aio_read+0x2e2/0x4da
[75007.309371]  [<c0184346>] ? do_sync_read+0xbf/0xfe
[75007.309371]  [<c013a2c4>] ? autoremove_wake_function+0x0/0x2d
[75007.309371]  [<f9191224>] ? file_pread+0x3c/0x66 [vzrst]
[75007.309371]  [<f9190f5d>] ? _rst_get_object+0x85/0x98 [vzrst]
[75007.309371]  [<f9195d1d>] ? do_rst_mm+0x1642/0x1b40 [vzrst]
[75007.309371]  [<f919634b>] ? rst_mm_complete+0x130/0x1f5 [vzrst]
[75007.309371]  [<f918fe58>] ? hook+0xa13/0x19db [vzrst]
[75007.309371]  [<c01db7ed>] ? cic_free_func+0x4b/0x53
[75007.309371]  [<c012cd42>] ? do_exit+0x642/0x646
[75007.309371]  [<c0197941>] ? mntput_no_expire+0x13/0xd2
[75007.309371]  [<c01079c7>] ? __switch_to+0xa5/0x125
[75007.309371]  [<c0125314>] ? schedule_tail+0x17/0x44
[75007.309371]  [<c01087fa>] ? ret_from_fork+0x6/0xe
[75007.309371]  [<f918f445>] ? hook+0x0/0x19db [vzrst]
[75007.309371]  [<f918f445>] ? hook+0x0/0x19db [vzrst]
[75007.309371]  [<c0109597>] ? kernel_thread_helper+0x7/0x10
[75007.309371]  =======================


Sorry for the long output, but this is what the sysrq-calls gave me:

Output of sysrq-m:
[  739.006053] Show Memory
[  739.006053] Mem-info:
[  739.006053] DMA per-cpu:
[  739.006053] CPU    0: hi:    0, btch:   1 usd:   0
[  739.006053] CPU    1: hi:    0, btch:   1 usd:   0
[  739.006053] CPU    2: hi:    0, btch:   1 usd:   0
[  739.006053] CPU    3: hi:    0, btch:   1 usd:   0
[  739.006053] Normal per-cpu:
[  739.006053] CPU    0: hi:  186, btch:  31 usd: 132
[  739.006053] CPU    1: hi:  186, btch:  31 usd:  55
[  739.006053] CPU    2: hi:  186, btch:  31 usd: 156
[  739.006053] CPU    3: hi:  186, btch:  31 usd:  88
[  739.006053] HighMem per-cpu:
[  739.006053] CPU    0: hi:  186, btch:  31 usd: 156
[  739.006053] CPU    1: hi:  186, btch:  31 usd:  64
[  739.006053] CPU    2: hi:  186, btch:  31 usd: 112
[  739.006053] CPU    3: hi:  186, btch:  31 usd:  41
[  739.006053] Active:7699 inactive:4989 dirty:53 writeback:0 unstable:0
[  739.006053]  free:1535325 slab:2934 mapped:1615 pagetables:136 bounce:0
[  739.006053] DMA free:12852kB min:68kB low:84kB high:100kB active:0kB inactive:0kB present:16240kB pages_scanned:0 all_unreclaimable? no
[  739.006053] lowmem_reserve[]: 0 872 7089 7089
[  739.006053] Normal free:798948kB min:3740kB low:4672kB high:5608kB active:12220kB inactive:5760kB present:893200kB pages_scanned:0 all_unreclaimable? no
[  739.006053] lowmem_reserve[]: 0 0 49735 49735
[  739.006053] HighMem free:5329500kB min:512kB low:7180kB high:13852kB active:18576kB inactive:14196kB present:6366080kB pages_scanned:0 all_unreclaimable? no
[  739.006053] lowmem_reserve[]: 0 0 0 0
[  739.006053] DMA: 3*4kB 3*8kB 5*16kB 2*32kB 4*64kB 3*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 2*4096kB = 12852kB
[  739.006053] Normal: 107*4kB 315*8kB 236*16kB 170*32kB 91*64kB 47*128kB 25*256kB 7*512kB 5*1024kB 7*2048kB 182*4096kB = 798916kB
[  739.006053] HighMem: 3*4kB 19*8kB 41*16kB 112*32kB 83*64kB 44*128kB 12*256kB 5*512kB 2*1024kB 3*2048kB 1294*4096kB = 5329396kB
[  739.006053] 10767 total pagecache pages
[  739.006053] Swap cache: add 0, delete 0, find 0/0
[  739.006053] Free swap  = 979956kB
[  739.006053] Total swap = 979956kB
[  739.024690] 1835008 pages of RAM
[  739.024690] 1605632 pages of HIGHMEM
[  739.024690] 279573 reserved pages
[  739.024690] 10533 pages shared
[  739.024690] 0 pages swap cached
[  739.024690] 53 pages dirty
[  739.024690] 0 pages writeback
[  739.024690] 1615 pages mapped
[  739.024690] 2934 pages slab
[  739.024690] 136 pages pagetables
[  739.024690] Top 10 caches:
[  739.024690] ext3_inode_cache     : size     461760 objsize        512
[  739.024690] sysfs_dir_cache      : size     434176 objsize         44
[  739.024690] radix_tree_node      : size     413696 objsize        288
...

Re: IBM RAID controller re-mount lock [message #37032 is a reply to message #36990] Thu, 13 August 2009 07:55 Go to previous message
maratrus is currently offline  maratrus
Messages: 1495
Registered: August 2007
Location: Moscow
Senior Member
Thank you very much for the provided information and sorry for delay.
I'd recommend you to file a bug into bugzilla.
http://bugzilla.openvz.org/

[Updated on: Thu, 13 August 2009 07:56]

Report message to a moderator

Previous Topic: Assigning an IP address to venet0 on the HN
Next Topic: Help Firefox
Goto Forum:
  


Current Time: Sat Oct 05 13:58:23 GMT 2024

Total time taken to generate the page: 0.03503 seconds