Home » General » Support » OpenVZ 7 containers crashing with ext4 errors
OpenVZ 7 containers crashing with ext4 errors [message #53655] |
Fri, 03 July 2020 15:06 |
allan.talver
Messages: 9 Registered: July 2020
|
Junior Member |
|
|
We have been running OpenVZ 6 in our environments for several years. The platform has been stable and predictable. Recently we have started evaluating OpenVZ 7 as the replacement. In most aspects, OpenVZ 7 has proven to be good and suitable for our purpose. However, recently we began to experience seemingly random crashes with symptoms pointing to ext4 filesystem and ploop. When the crash happens, virtual container is left with its disk in read only state. Restart is not successful due to errors present in the filesystem. After running fsck manually, container is able to start and we have not experienced any data loss. However, even without data loss, such events reduce the confidence to run production workloads with critical data on these servers.
In total we have now had 4 such events. First 3 were Ubuntu 16.04 containers that got migrated from OpenVZ 6 to OpenVZ 7. Before migration with ovztransfer.sh, the server disks got converted from simfs to ploop with vzctl convert. Initially we thought that the issue might be something in our migration procedure or something specific to Ubuntu 16.04 operating system, because no Ubuntu 18.04 server (created fresh on OpenVZ 7) had crashed. However, two days ago the container affected by the last crash was 18.04 which never was migrated from OpenVZ 6 (although it has been migrated between OpenVz 7 nodes).
Another thing we have noticed is that the crash seems to be happening roughly at the same time when pcompact is running on the hardware node. And also, 2 out of 3 containers that had been migrated from OpenVZ 6, crashed the next night right after the migration.
Are these errors something that the community has seen before and could help us explain?
In all cases, the log output in hardware node dmesg has been similar and as follows:
[2020-05-29 02:02:20] WARNING: CPU: 12 PID: 317821 at fs/ext4/ext4_jbd2.c:266 __ext4_handle_dirty_metadata+0x1c2/0x220 [ext4]
[2020-05-29 02:02:20] Modules linked in: nfsv3 nfs_acl ip6table_mangle nf_log_ipv4 nf_log_common xt_LOG nfsv4 dns_resolver nfs lockd grace fscache xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_comment binfmt_misc xt_CHECKSUM iptable_mangle ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 tun 8021q garp mrp devlink ip6table_filter ip6_tables iptable_filter bonding ebtable_filter ebt_among ebtables sunrpc iTCO_wdt iTCO_vendor_support sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_i801 joydev mei_me lpc_ich mei sg ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad pcc_cpufreq ip_vs nf_conntrack libcrc32c br_netfilter veth overlay ip6_vzprivnet
[2020-05-29 02:02:20] ip6_vznetstat ip_vznetstat ip_vzprivnet vziolimit vzevent vzlist vzstat vznetstat vznetdev vzmon vzdev bridge stp llc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ixgbe ahci drm libahci libata megaraid_sas crct10dif_pclmul crct10dif_common crc32c_intel mdio ptp pps_core drm_panel_orientation_quirks dca dm_mirror dm_region_hash dm_log dm_mod pio_kaio pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop
[2020-05-29 02:02:20] CPU: 12 PID: 317821 Comm: e4defrag2 ve: 0 Kdump: loaded Not tainted 3.10.0-1062.12.1.vz7.131.10 #1 131.10
[2020-05-29 02:02:20] Hardware name: Supermicro SYS-1028R-WC1RT/X10DRW-iT, BIOS 2.0a 07/26/2016
[2020-05-29 02:02:20] Call Trace:
[2020-05-29 02:02:20] [<ffffffff81baebc7>] dump_stack+0x19/0x1b
[2020-05-29 02:02:20] [<ffffffff8149bdc8>] __warn+0xd8/0x100
[2020-05-29 02:02:20] [<ffffffff8149bf0d>] warn_slowpath_null+0x1d/0x20
[2020-05-29 02:02:20] [<ffffffffc047cd82>] __ext4_handle_dirty_metadata+0x1c2/0x220 [ext4]
[2020-05-29 02:02:20] [<ffffffffc0474f44>] ext4_ext_split+0x304/0x9a0 [ext4]
[2020-05-29 02:02:20] [<ffffffffc0476dfd>] ext4_ext_insert_extent+0x7bd/0x8d0 [ext4]
[2020-05-29 02:02:20] [<ffffffffc0479a7f>] ext4_ext_map_blocks+0x5cf/0xf60 [ext4]
[2020-05-29 02:02:20] [<ffffffffc0446676>] ext4_map_blocks+0x136/0x6b0 [ext4]
[2020-05-29 02:02:20] [<ffffffffc047496c>] ? ext4_alloc_file_blocks.isra.36+0xbc/0x2f0 [ext4]
[2020-05-29 02:02:20] [<ffffffffc047498f>] ext4_alloc_file_blocks.isra.36+0xdf/0x2f0 [ext4]
[2020-05-29 02:02:20] [<ffffffffc047b6bd>] ext4_fallocate+0x15d/0x990 [ext4]
[2020-05-29 02:02:20] [<ffffffff8166cb78>] ? __sb_start_write+0x58/0x120
[2020-05-29 02:02:20] [<ffffffff81666c72>] vfs_fallocate+0x142/0x1e0
[2020-05-29 02:02:20] [<ffffffff81667cdb>] SyS_fallocate+0x5b/0xa0
[2020-05-29 02:02:20] [<ffffffff81bc1fde>] system_call_fastpath+0x25/0x2a
[2020-05-29 02:02:20] ---[ end trace cf8fe0ecbf57efcc ]---
[2020-05-29 02:02:20] EXT4-fs: ext4_ext_split:1139: aborting transaction: error 28 in __ext4_handle_dirty_metadata
[2020-05-29 02:02:20] EXT4-fs error (device ploop52327p1): ext4_ext_split:1139: inode #325519: block 6002577: comm e4defrag2: journal_dirty_metadata failed: handle type 3 started at line 4741, credits 8/0, errcode -28
[2020-05-29 02:02:20] Aborting journal on device ploop52327p1-8.
[2020-05-29 02:02:20] EXT4-fs (ploop52327p1): Remounting filesystem read-only
[2020-05-29 02:02:20] EXT4-fs error (device ploop52327p1) in ext4_free_blocks:4915: Journal has aborted
[2020-05-29 02:02:20] EXT4-fs error (device ploop52327p1) in ext4_free_blocks:4915: Journal has aborted
[2020-05-29 02:02:20] EXT4-fs error (device ploop52327p1) in ext4_reserve_inode_write:5360: Journal has aborted
[2020-05-29 02:02:20] EXT4-fs error (device ploop52327p1) in ext4_alloc_file_blocks:4753: error 28
Thanks!
|
|
|
Re: OpenVZ 7 containers crashing with ext4 errors [message #53656 is a reply to message #53655] |
Fri, 03 July 2020 16:07 |
khorenko
Messages: 533 Registered: January 2006 Location: Moscow, Russia
|
Senior Member |
|
|
Hi,
1) update, vz7 update 14 has quite a number of ploop-related fixes (kernel vz7.151.14).
2) if you face this issue again on vz7 u14, please file a bug at bugs.openvz.org
(which full logs, not just a snippet)
3)
Quote:
EXT4-fs: ext4_ext_split:1139: aborting transaction: error 28 in __ext4_handle_dirty_metadata
"error 28" means space shortage
Quote:
#define ENOSPC 28 /* No space left on device */
So please check it ploop usage is close to it's size.
4) i don't know if you have messages like
Quote:
[2040002.704309] Purging lru entry from extent tree for inode 356516155 (map_size=50243 ratio=12612%)
[2040002.704318] max_extent_map_pages=16384 is too low for ploop_io_images_size=7914092756992 bytes
if you do, increase the following parameter, say by 10 times
/sys/module/pio_direct/parameters/max_extent_map_pages
and check if it helps.
The increase can be done on the fly, no reboot required.
But first of all - update!
If your problem is solved - please, report it!
It's even more important than reporting the problem itself...
|
|
|
Re: OpenVZ 7 containers crashing with ext4 errors [message #53664 is a reply to message #53656] |
Wed, 22 July 2020 11:45 |
allan.talver
Messages: 9 Registered: July 2020
|
Junior Member |
|
|
Hello
Thank you for the reply! According to your suggestion we updated all our existing OpenVZ 7 nodes to the 14 version over the past two weeks. We have not yet experienced any crashes on updated nodes (one crash did happen on 11 July on a node that had not yet been updated).
As it was pointed out, it seems that disk space could be a contributing factor to the issue. But I can assure that disks of these failing containers are definitely not full. Some have very low utilisation (around 30%). We have noticed another behaviour which we see is possibly related to the crashes we have experienced, and also points to issues with not enough disk space available. While pcompact is running, some virtual containers show extreme changes in disk utilisation. Usually the disk suddenly shows as full and goes down to normal several times during the time pcompact runs. One example of pcompact.log output while one of such vps' is being compacted:
2020-07-22T02:00:12+0200 pcompact : Inspect 7a81d5ef-9a70-4a20-bb57-cf38f45b2926
2020-07-22T02:00:12+0200 pcompact : Inspect /vz/private/4001/root.hdd/DiskDescriptor.xml
2020-07-22T02:00:12+0200 pcompact : ploop=107520MB image=39805MB data=20760MB balloon=0MB
2020-07-22T02:00:12+0200 pcompact : Rate: 17.7 (threshold=10)
2020-07-22T02:00:12+0200 pcompact : Start compacting (to free 13669MB)
2020-07-22T02:00:12+0200 : Start defrag dev=/dev/ploop12981p1 mnt=/vz/root/4001 blocksize=2048
2020-07-22T02:09:48+0200 : Trying to find free extents bigger than 0 bytes granularity=1048576
2020-07-22T02:09:49+0200 pcompact : ploop=107520MB image=29687MB data=20767MB balloon=0MB
2020-07-22T02:09:49+0200 pcompact : Stats: uuid=7a81d5ef-9a70-4a20-bb57-cf38f45b2926 ploop_size=107520MB image_size_before=39805MB image_size_after=29687MB compaction_time=577.227s type=online
2020-07-22T02:09:49+0200 pcompact : End compacting
And then we have one container where disk usage stays near 100% (but fluctuating) for 2 hours until pcompact times out (I tried attaching a screenshot, but got an error that "Attachment is too big", even if the file was quite small.). Ploop.log shows:
2020-07-22T02:00:01+0200 pcompact : Inspect 240e4613-e12c-46b1-bc06-d001b12463c8
2020-07-22T02:00:01+0200 pcompact : Inspect /vz/private/4116/root.hdd/DiskDescriptor.xml
2020-07-22T02:00:01+0200 pcompact : ploop=261120MB image=116695MB data=87568MB balloon=0MB
2020-07-22T02:00:01+0200 pcompact : Rate: 11.2 (threshold=10)
2020-07-22T02:00:01+0200 pcompact : Start compacting (to free 16070MB)
2020-07-22T02:00:01+0200 : Start defrag dev=/dev/ploop48627p1 mnt=/vz/root/4116 blocksize=2048
2020-07-22T04:00:21+0200 : Error in wait_pid (balloon.c:967): The /usr/sbin/e4defrag2 process killed by signal 15
2020-07-22T04:00:21+0200 : /usr/sbin/e4defrag2 exited with error
2020-07-22T04:00:21+0200 : Trying to find free extents bigger than 0 bytes granularity=1048576
2020-07-22T04:00:23+0200 pcompact : ploop=261120MB image=100782MB data=87487MB balloon=0MB
2020-07-22T04:00:23+0200 pcompact : Stats: uuid=240e4613-e12c-46b1-bc06-d001b12463c8 ploop_size=261120MB image_size_before=116695MB image_size_after=100782MB compaction_time=7221.741s type=online
2020-07-22T04:00:23+0200 pcompact : End compacting
This node is running a MySQL server which shows errors during these 2 hours (different errors pointing to disk being full). Eventually MySQL crashes.
We'll continue to monitor and report back how it goes. But has anyone experienced such fluctuations in disk utilisation while pcompact is running? Is it somehow expected? How to get around applications failing due to disk showing as full (even when it is actually not). Worth mentioning that all these issues described in this post happen on newly created Ubuntu 18.04 containers (as opposed to my initial post where issues were mostly related to 16.04 containers migrated from OpenVZ 6).
Thanks!
[Updated on: Wed, 22 July 2020 11:50] Report message to a moderator
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Re: OpenVZ 7 containers crashing with ext4 errors [message #53682 is a reply to message #53655] |
Mon, 24 August 2020 16:48 |
nathan.brownrice
Messages: 15 Registered: August 2020
|
Junior Member |
|
|
Hello All, we're having this same issue as well.
The issue is that overnight, a VPSs filesystem will go read-only. We've seen this on no fewer than 5-10 different VPSs since switching to ovz7 in the last year. In some cases the filesystem can be repaired using the normal recovery methods (I.E. https://virtuozzosupport.force.com/s/article/000014682 and https://inertz.org/container-corruption-easy-repair-using-fsck/), but sometimes things are irrecoverable and we have to restore the entire container from backup. This is a pretty big deal.
The first thing we noticed was, in /var/log/messages on the host machine, this is happening right when the VPS goes read-only (notice the similar timestamps to the OP's issue):
Aug 21 02:00:04 ovz7-3-taos pcompact[29446]: {"operation":"pcompactStart", "uuid":"fa79f45c-5f32-4b6c-8ce5-9d4a012e43c8", "disk_id":0, "task_id":"290cde3e-e52f-4e29-9f51-a9db197887eb", "ploop_size":98078, "image_size":93758, "data_size":34888, "balloon_size":280802, "rate":60.0, "config_dry":0, "config_threhshold":10}
Aug 21 02:00:11 ovz7-3-taos pcompact[29446]: {"operation":"pcompactFinish", "uuid":"fa79f45c-5f32-4b6c-8ce5-9d4a012e43c8", "disk_id":0, "task_id":"290cde3e-e52f-4e29-9f51-a9db197887eb", "was_compacted":1, "ploop_size":98078, "stats_before": {"image_size":93758, "data_size":34888, "balloon_size":280802}, "stats_after": {"image_size":93758, "data_size":34888, "balloon_size":280802},"time_spent":"7.016s", "result":-1}
The next thing we noticed, after seeing the above error, is that the pcompact.log has the following:
2020-08-21T02:00:04-0600 pcompact : Inspect fa79f45c-5f32-4b6c-8ce5-9d4a012e43c8
2020-08-21T02:00:04-0600 pcompact : Inspect /vz/private/fa79f45c-5f32-4b6c-8ce5-9d4a012e43c8/root.hdd/DiskDescriptor.xml
2020-08-21T02:00:04-0600 pcompact : ploop=98078MB image=93758MB data=34888MB balloon=280802MB
2020-08-21T02:00:04-0600 pcompact : Rate: 60.0 (threshold=10)
2020-08-21T02:00:04-0600 pcompact : Start compacting (to free 53965MB)
2020-08-21T02:00:04-0600 : Start defrag dev=/dev/ploop43779p1 mnt=/vz/root/fa79f45c-5f32-4b6c-8ce5-9d4a012e43c8 blocksize=2048
2020-08-21T02:00:11-0600 : Error in wait_pid (balloon.c:962): The /usr/sbin/e4defrag2 process failed with code 1
2020-08-21T02:00:11-0600 : /usr/sbin/e4defrag2 exited with error
2020-08-21T02:00:11-0600 : Trying to find free extents bigger than 0 bytes granularity=1048576
2020-08-21T02:00:11-0600 : Error in ploop_trim (balloon.c:892): Can't trim file system: Input/output error
2020-08-21T02:00:11-0600 pcompact : ploop=98078MB image=93758MB data=34888MB balloon=280802MB
2020-08-21T02:00:11-0600 pcompact : Stats: uuid=fa79f45c-5f32-4b6c-8ce5-9d4a012e43c8 ploop_size=98078MB image_size_before=93758MB image_size_after=93758MB compaction_time=7.016s type=online
2020-08-21T02:00:11-0600 pcompact : End compacting
This is basically identical to what's being discussed in this thread. We've just spun up a new host machine with a fresh OS install, and it looks like the newest ISO still has the old kernel version (vz7.151.14), so we've applied the patch as discussed here.
What I'd like to discuss:
1) Others that have had this same issue, and have applied the kernel update, did this fix your issues?
2) We have several other production host machines, and it's going to take a lot of moving things around before we can safely kernel update them. We're working on this, but in the meantime is there a way to ensure this doesn't happen?
It looks like the initial error is happening during pcompact defrag, which we see can be disabled as per https:// docs.openvz.org/openvz_command_line_reference.webhelp/_pcomp act_conf.html . Perhaps this would prevent the issue from happening if this were to be temporarily disabled until we can get the kernels updated. Or, perhaps we could disable pcompact altogether. Any thoughts or suggestions on this?
Thanks for the wonderful software and the great community behind it!
|
|
|
|
Re: OpenVZ 7 containers crashing with ext4 errors [message #53685 is a reply to message #53681] |
Wed, 26 August 2020 11:46 |
khorenko
Messages: 533 Registered: January 2006 Location: Moscow, Russia
|
Senior Member |
|
|
noc.r wrote on Mon, 24 August 2020 17:11I have a few questions, if you dont mind and know the answer:
* Why is the kpatch-kmod package not available for my kernel? is there some kind of issue on my end?
* Is it supposed to be available for my kernel?
* Finally, any way to test that the patch works other than wait for the issues to appear?
Well, Zhenya has covered first 2 questions, AFAIS,
so the only last question left here.
And - no, i do not see any simple way for verification because of
1) the kpatch patch contain 2 mainstream commits:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin ux.git/commit/?id=812c0cab2c0dfad977605dbadf9148490ca5d93f
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin ux.git/commit/?id=4134f5c88dcd
And those patches verification is not so easy.
But most important reason 2:
2) Even if we verify those patches to fix the problem they are intended to fix,
this does not mean, it will 100% fix your exact problem.
We think so, but this is not 100% proven.
So, please let us know if you experience issues when the kpatch patch is loaded.
Thank you.
If your problem is solved - please, report it!
It's even more important than reporting the problem itself...
|
|
|
Re: OpenVZ 7 containers crashing with ext4 errors [message #53686 is a reply to message #53682] |
Wed, 26 August 2020 12:02 |
khorenko
Messages: 533 Registered: January 2006 Location: Moscow, Russia
|
Senior Member |
|
|
nathan.brownrice wrote on Mon, 24 August 2020 19:48What I'd like to discuss:
1) Others that have had this same issue, and have applied the kernel update, did this fix your issues?
We would be also very interesting in the feedback on the patch.
nathan.brownrice wrote on Mon, 24 August 2020 19:482) We have several other production host machines, and it's going to take a lot of moving things around before we can safely kernel update them.
You can install the kpatch patch from this thread on one node and check if it helps - it does not require Node reboot.
nathan.brownrice wrote on Mon, 24 August 2020 19:48It looks like the initial error is happening during pcompact defrag, which we see can be disabled as per https:// docs.openvz.org/openvz_command_line_reference.webhelp/_pcomp act_conf.html . Perhaps this would prevent the issue from happening if this were to be temporarily disabled until we can get the kernels updated. Or, perhaps we could disable pcompact altogether. Any thoughts or suggestions on this?
May be you are right and pcompact significantly increases the chances the issue to trigger.
But the truth is - until you (or someone else if you 100% sure you have exactly the same issue) test the patch,
you cannot be sure the issue is ever fixed, so you can wait for "updated kernels" forever.
i mean - something will be fixed in newer kernels, but it easily could be not a fix for your particular issue.
So i really suggest to install the kpatch patch and check if it helps.
After that you will know you (and we) are on the right way at least.
BTW, if someone prefers full kernel instead of kpatch patch installation,
here you are:
http://fe.virtuozzo.com/a3001136c32272a6889092b16af03f64/
This kernel contains a dozen of patches comparing to vz7.151.14 kernel, but all of them are important, most of them are already released as ReadyKernel patches,
so this kernel can be considered as a stable one.
If your problem is solved - please, report it!
It's even more important than reporting the problem itself...
|
|
|
|
|
|
|
|
|
Re: OpenVZ 7 containers crashing with ext4 errors [message #53694 is a reply to message #53655] |
Fri, 11 September 2020 14:08 |
allan.talver
Messages: 9 Registered: July 2020
|
Junior Member |
|
|
Hello,
Wanted to give a short update and ask a couple of questions.
First of all, last night one of the virtual containers on a non-patched host went into read-only. However, different from the previous cases is that pcompact was not involved in this case. We actually have pcompact cron disabled and we trigger it manually on the nodes that are in our test sample. The errors in messages log were:
Sep 11 00:34:00 server-n697 kernel: bash (639778): drop_caches: 3
Sep 11 00:34:36 server-n697 systemd: Started Session c185489 of user root.
Sep 11 00:34:39 server-n697 kernel: EXT4-fs error (device ploop35478p1) in ext4_free_blocks:4933: Out of memory
Sep 11 00:34:39 server-n697 kernel: Aborting journal on device ploop35478p1-8.
Sep 11 00:34:39 server-n697 kernel: EXT4-fs (ploop35478p1): Remounting filesystem read-only
Sep 11 00:34:39 server-n697 kernel: EXT4-fs error (device ploop35478p1) in ext4_ext_remove_space:3073: IO failure
Sep 11 00:34:39 server-n697 kernel: EXT4-fs error (device ploop35478p1) in ext4_ext_truncate:4692: Journal has aborted
Sep 11 00:34:39 server-n697 kernel: EXT4-fs error (device ploop35478p1) in ext4_reserve_inode_write:5358: Journal has aborted
Sep 11 00:34:39 server-n697 kernel: EXT4-fs error (device ploop35478p1) in ext4_truncate:4145: Journal has aborted
Sep 11 00:34:39 server-n697 kernel: EXT4-fs error (device ploop35478p1) in ext4_reserve_inode_write:5358: Journal has aborted
Sep 11 00:34:39 server-n697 kernel: EXT4-fs error (device ploop35478p1) in ext4_orphan_del:2731: Journal has aborted
Sep 11 00:34:39 server-n697 kernel: EXT4-fs error (device ploop35478p1) in ext4_reserve_inode_write:5358: Journal has aborted
Sep 11 00:34:39 server-n697 kernel: EXT4-fs (ploop35478p1): ext4_writepages: jbd2_start: 0 pages, ino 661650; err -30
Sep 11 00:34:41 server-n697 kernel: dd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Sep 11 00:34:41 server-n697 kernel: dd cpuset=3988 mems_allowed=0
Sep 11 00:34:41 server-n697 kernel: CPU: 6 PID: 273822 Comm: dd ve: 3988 Kdump: loaded Not tainted 3.10.0-1127.8.2.vz7.151.14 #1 151.14
Sep 11 00:34:41 server-n697 kernel: Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0c 03/24/2014
Sep 11 00:34:41 server-n697 kernel: Call Trace:
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb95b67f1>] dump_stack+0x19/0x1b
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb95b0fc6>] dump_header+0x90/0x229
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb8fd7076>] ? find_lock_task_mm+0x56/0xc0
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb8fd7dad>] oom_kill_process+0x47d/0x640
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb90040fe>] ? get_task_oom_score_adj+0xee/0x100
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb8fd7213>] ? oom_badness+0x133/0x1e0
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb905f509>] mem_cgroup_oom_synchronize+0x4b9/0x510
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb8fd84c3>] pagefault_out_of_memory+0x13/0x50
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb95af06d>] mm_fault_error+0x6a/0x157
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb95c49a1>] __do_page_fault+0x491/0x500
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb95c4a45>] do_page_fault+0x35/0x90
Sep 11 00:34:41 server-n697 kernel: [<ffffffffb95c0778>] page_fault+0x28/0x30
Sep 11 00:34:41 server-n697 kernel: Task in /machine.slice/3988 killed as a result of limit of /machine.slice/3988
Sep 11 00:34:41 server-n697 kernel: memory: usage 4192232kB, limit 4194304kB, failcnt 13429301
Sep 11 00:34:41 server-n697 kernel: memory+swap: usage 4325376kB, limit 4325376kB, failcnt 31733768989
Sep 11 00:34:41 server-n697 kernel: kmem: usage 36356kB, limit 9007199254740988kB, failcnt 0
Sep 11 00:34:41 server-n697 kernel: Memory cgroup stats for /machine.slice/3988: rss_huge:479232KB mapped_file:33780KB shmem:84KB slab_unreclaimable:9776KB swap:133144KB cache:3459360KB rss:696520KB slab_reclaimable:10664KB inactive_anon:199280KB active_anon:497324KB inactive_file:597760KB active_file:2861360KB unevictable:0KB
Sep 11 00:34:41 server-n697 kernel: Memory cgroup out of memory: Kill process 671560 (mysqld) score 87 or sacrifice child
Sep 11 00:34:41 server-n697 kernel: Killed process 581188 (mysqld) in VE "3988", UID 116, total-vm:2820740kB, anon-rss:376664kB, file-rss:0kB, shmem-rss:0kB
Sep 11 00:34:53 server-n697 systemd: Started Session c185490 of user root.
Sep 11 00:35:01 server-n697 kernel: bash (639778): drop_caches: 3
Please note that the container was running the script that was creating and deleting random files. The script we use to generate read-write activity for testing purposes. Of course, the script also failed after filesystem was switched to read-only mode.
Would you think that the cause of this error is related to the same bug that the kernel patch should fix? Meaning that pcompact is not part of the problem, but just happens to amplify the issue.
Secondly, we see that while pcompact is running, the virtual container disk utilisation fluctuates rapidly and the disk also comes 100% full several time. Is that an expected behaviour during a pcompact run? As a result, we have seen some applications (like MySQL) throwing errors saying that it's unable to write to disk. Feels like it could cause data corruption.
Thirdly, worth noting that we have not yet seen a container switching to read-only on a node that has the kernel patch applied. But even on that node, the disk was showing as full when pcompact ran, so the concern regarding potential data corruption applies.
|
|
|
|
|
|
|
Re: OpenVZ 7 containers crashing with ext4 errors [message #53704 is a reply to message #53703] |
Mon, 05 October 2020 10:22 |
eshatokhin
Messages: 6 Registered: August 2020
|
Junior Member |
|
|
allan.talver wrote on Mon, 05 October 2020 05:56Is it reasonable to expect that the fix will be part of any future kernels after 3.10.0-1127.8.2.vz7.151.14?
The patches were included into kernel 3.10.0-1127.18.2.vz7.163.2. If they actually fix your issue, updating kernel to that or a newer version (when it is available) could help.
[Updated on: Mon, 05 October 2020 10:24] Report message to a moderator
|
|
|
Re: OpenVZ 7 containers crashing with ext4 errors [message #53707 is a reply to message #53655] |
Wed, 14 October 2020 08:02 |
allan.talver
Messages: 9 Registered: July 2020
|
Junior Member |
|
|
Hi all,
A short update and a few questions:
* We have now roughly 50 nodes running with vz7.151.14 kernel and the patch and we have not had containers turning into read-only mode. I think this is good.
* However, we still have another issue related to pcompact and, as we have increased the number of nodes on vz7, it also is causing us more trouble. The issue is that while pcompact and defrag are running, the disk space utilisation of the container starts fluctuating rapidly and during that period, it is also shown as 100% full several times. In turn, when the disk shows as full, applications are unable to write to the disk and it causes unexpected errors and application crashes (for example MySQL and Redis instances with configured with persistence). Are we the only ones experiencing this behaviour? Is there a way to avoid this? For now we have stopped pcompact again because it has caused several production incidents.
* A question related to the previous point, but what I would like to address separately. What are the downsides if we don't run pcompact at all? I think one obvious outcome is that we won't be able to release unused space in the container disk images back to the host operating system. But we don't see it as a big issue because we normally do not overcommit hardware node disk space anyway (meaning that total size of container disks is below the size of disk space on the host node). In this scenario, could we just leave pcompact disabled?
* We noticed that now there is new kernel version vz7.158.8 which is getting installed on the newer nodes. Of course the patch that has been shared in this thread here is not usable for this version. But I believe that the kernel fix is not yet in that version of the kernel. Is it possible to have another patch for vz7.158.8?
Thanks and best regards,
Allan
|
|
|
Re: OpenVZ 7 containers crashing with ext4 errors [message #53708 is a reply to message #53707] |
Wed, 14 October 2020 08:27 |
eshatokhin
Messages: 6 Registered: August 2020
|
Junior Member |
|
|
allan.talver wrote on Wed, 14 October 2020 08:02
* We noticed that now there is new kernel version vz7.158.8 which is getting installed on the newer nodes. Of course the patch that has been shared in this thread here is not usable for this version. But I believe that the kernel fix is not yet in that version of the kernel. Is it possible to have another patch for vz7.158.8?
I have checked - the kernel vz7.158.8 also has the needed fixes for ext4, same as the in-develoment vz7.163.x series. Live patches with these fixes are not needed there.
|
|
|
Goto Forum:
Current Time: Fri Dec 13 02:16:59 GMT 2024
Total time taken to generate the page: 0.02642 seconds
|