[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Linux: balloon_process() causing workqueue lockups?
On 27.08.21 11:01, Jan Beulich wrote: Hello, ballooning down Dom0 by about 16G in one go once in a while causes: BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 64s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=2/256 refcnt=3 in-flight: 229:balloon_process pending: cache_reap workqueue events_freezable_power_: flags=0x84 pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 pending: disk_events_workfn workqueue mm_percpu_wq: flags=0x8 pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 pending: vmstat_update pool 12: cpus=6 node=0 flags=0x0 nice=0 hung=64s workers=3 idle: 2222 43 I've tried to double check that this isn't related to my IOMMU work in the hypervisor, and I'm pretty sure it isn't. Looking at the function I see it has a cond_resched(), but aiui this won't help with further items in the same workqueue. Thoughts? I'm seeing two possible solutions here: 1. After some time (1 second?) in balloon_process() setup a new workqueue activity and return (similar to EAGAIN, but without increasing the delay). 2. Don't use a workqueue for the ballooning activity, use a kernel thread instead. I have a slight preference for 2, even if the resulting patch will be larger. 1 is only working around the issue and it is hard to find a really good timeout value. I'd be fine to write a patch, but would prefer some feedback which way to go. Juergen Attachment:
OpenPGP_0xB0DE9DD628BF132F.asc Attachment:
OpenPGP_signature
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |