[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH v2] Fix scheduler crash after s3 resume

Hi, had to made another version of this patch which also fixes the additional interminent crash in vcpu_wake(). Would be grateful for comments / possible ack again.

This crash was happening when vcpu_wake is called between disable_nonboot_cpus() and enable_nonboot_cpus() on s3 path (happens for example from the insides of rcu_barrier() sometime). It turns out that it was due to vcpu_schedule_lock() accessing per_cpu area, which however is freed at this point as it's freed in percpu.c cpu down callback.

I tried the approach of preserving per cpu area on cpu down/up (during s3), as well as testing for cpu being online in vcpu_wake before acquiring this lock, but ultimately although helping a bit these were not fully succesful. So concluded it's probably not really correct to let the scheduler run rampart during the disable_nonboot_cpus() / enable_nonboot_cpus() window on s3 path and made a new patch version. Tested it across many s3 iterations (on lenovo T520), with no problems. It should be pretty uninvasive as it only touches S3 path.

Changes from v1:
- modified cpu_disable_scheduler (schedule.c) to run most of stop scheduler logic again (i.e. the vcpu migrate). This is partial revert of c-s 25079:d5ccb2d1dbd1 . However, breaking of domain vcpu affinities seems to be avoidable on this path, so added a condition to do just that

- instead of skipping cpupool0->is_valid cpumask clear on suspend path, added restore of that bit on resume path. Moving this was needed because otherwise cpu_disable_scheduler() fails to migrate the vcpu (as it's still in cpu_valid mask when it's attempted), return EAGAIN and BUG() will fire in __cpu_disable()

Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@xxxxxxxxxx>

Commit message:
Fix S3 resume regression after C-S 25079:d5ccb2d1dbd1. Regression causes either an interminent crash in vcpu_wake (attempt to access vcpu_schedule_lock which is in freed per cpu area at this point), or, in debug xen, more frequent ASSERT(!cpumask_empty(&cpus)...) firing in _csched_cpu_pick.

Fix this by reverting the hunk which turned off disabling cpu scheduler on suspend path. Additionally, avoid breaking domain vcpu affinities on suspend path. On resume, restore the frozen cpus in cpupool's cpu_valid mask, so they can once again be used by scheduler.

Attachment: fix-suspend-scheduler-v2
Description: Text document

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.