Xen project Mailing List

Re: [RFC PATCH] x86/p2m-pt: do type recalculations with p2m read lock

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Tue, 4 Apr 2023 11:49:14 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Sg2Fb3jl6e1BDtSDZUlN864Ch3WiFTsjN/pXDxcysgg=; b=hYEET9wrBbHsF0a/r/NhJmHJFBg7Bqjej7m2ZSaE/4vE2zujEFe2LxNPPG/dwdWbJAn/zYwPOos2jB8czjLnw3GNWZYep5tu4jsY+4ZX86piv4WwSpUONVLRo/ocn/eGMYomZFvvnKvhWG1eueh/ZEqdUq+kHpB7UOFbnhyqQWLSI88M0uL4gwEF0jl3ypRevLf5p/p5UT/OsbbhHSWPzYd855oWuqvUeMhvK98bX4aHmVnG4Pok9cibT6MH46gTw6Ia8HRz7gAP1cTk7zMnvgzKZgbmqy26uo2VCa006S9uxdtFvO2QtmAZObcjCH0f6H8Fl2OaWyN1yIAJ0y6XOg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=A9p2QuxIUIdjPEf8QDeg7FWnVBRGJQ6aGWQakcXLDPEp333o/1J+isxpWifpMwyF/j2BnBvxC0ecSOFQJ6tGEZWQFwjRxgqkYVU8QzffK1k3uoqTYoG/FJNj8GMuou8VCwiNRxeQblWxZCu7DVtzXaArIsVjCJYTQvcLRMJiGFiqQL0HFgVphvfKI3GP+lT4jYR2immaXrABKFFqNkH5GCI1PEK6KkbvYCuIIb/pcOYVjHs93acDUkjQvpsjX7ITLcNyk/dEm9TVI5GtM7FKpWrbPlr0IUxUtWBDqiNXhj8xJ84mc6Aq+lYJ4gFRs4Zq2eLql+zBMlg7PT9jHifxyA==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Tue, 04 Apr 2023 09:49:46 +0000

Ironport-data: A9a23:abX7gqgd5v+uiYSuq6vRI2rvX161RxEKZh0ujC45NGQN5FlHY01je htvXDiFPP6ON2TyftAnaN7l8xtVvcPUzdFkHFc9qiwxRCIb9cadCdqndUqhZCn6wu8v7q5Ex 55HNoSfdpBcolv0/ErF3m3J9CEkvU2wbuOgTrWCYmYpHlUMpB4J0XpLg/Q+jpNjne+3CgaMv cKai8DEMRqu1iUc3lg8sspvkzsy+qWi0N8klgZmP6sT4AeCzyB94K83fsldEVOpGuG4IcbiL wrz5OnR1n/U+R4rFuSknt7TGqHdauePVeQmoiM+t5mK2nCulARrukoIHKN0hXNsoyeIh7hMJ OBl7vRcf+uL0prkw4zxWzEAe8130DYvFLXveRBTuuTLp6HKnueFL1yDwyjaMKVBktubD12i+ tQzLTURNT3anNiIg6ORT+lgguVgAsn0adZ3VnFIlVk1DN4AaLWaGeDv2oUd2z09wMdTAfzZe swVLyJ1awjNaAFOPVFRD48imOCvhT/0dDgwRFC9/PJrpTSMilIvluSyWDbWUoXiqcF9hEGXq 3iA523kKhobKMae2XyO9XfEaurnxHulBt1KSOXhnhJsqFShlnc9WDY6bmqcuMPnsFObGM1cA lNBr0LCqoB3riRHVOLVXRe1vXqFtR40QMdLHqsx7wTl4rXQyxaUAC4DVDEpQMwrsoo6SCIn0 neNnsj1Hnp/vbuNU3Wf+7yI6zSoNkAowXQqYCYFSU4A/IPlqYRq1BbXFI4/T+iyk8H/Hiz2z 3aSti8iir4PjMkNkaKm4VTAhDHqrZ/MJuIo2jjqsquexlsRTOaYi0aAsDA3Md4owF6lc2S8

Ironport-hdrordr: A9a23:QfaZqqybX/sjhG69DjMEKrPw6L1zdoMgy1knxilNoHxuH/Bw9v re+cjzsCWftN9/Yh4dcLy7VpVoIkmsl6Kdg7NwAV7KZmCP1FdARLsI0WKI+UyCJ8SRzI9gPa cLSdkFNDXzZ2IK8PoTNmODYqodKNrsytHWuQ/HpU0dKT2D88tbnn9E4gDwKDwQeCB2QaAXOb C7/cR9qz+paR0sH7+G7ilsZZmkmzXT/qiWGCI7Ow==

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Apr 03, 2023 at 06:07:26PM +0200, Jan Beulich wrote: > On 03.04.2023 17:38, Roger Pau Monné wrote: > > On Mon, Apr 03, 2023 at 05:32:39PM +0200, Jan Beulich wrote: > >> On 03.04.2023 12:14, Roger Pau Monne wrote: > >>> Global p2m type recalculations (as triggered by logdirty) can create > >>> so much contention on the p2m lock that simple guest operations like > >>> VCPUOP_set_singleshot_timer on guests with a high amount of vCPUs (32) > >>> will cease to work in a timely manner, up to the point that Linux > >>> kernel versions that sill use the VCPU_SSHOTTMR_future flag with the > >>> singleshot timer will cease to work: > >>> > >>> [ 82.779470] CE: xen increased min_delta_ns to 1000000 nsec > >>> [ 82.793075] CE: Reprogramming failure. Giving up > >>> [ 82.779470] CE: Reprogramming failure. Giving up > >>> [ 82.821864] CE: xen increased min_delta_ns to 506250 nsec > >>> [ 82.821864] CE: xen increased min_delta_ns to 759375 nsec > >>> [ 82.821864] CE: xen increased min_delta_ns to 1000000 nsec > >>> [ 82.821864] CE: Reprogramming failure. Giving up > >>> [ 82.856256] CE: Reprogramming failure. Giving up > >>> [ 84.566279] CE: Reprogramming failure. Giving up > >>> [ 84.649493] Freezing user space processes ... > >>> [ 130.604032] INFO: rcu_sched detected stalls on CPUs/tasks: { 14} > >>> (detected by 10, t=60002 jiffies, g=4006, c=4005, q=14130) > >>> [ 130.604032] Task dump for CPU 14: > >>> [ 130.604032] swapper/14 R running task 0 0 1 > >>> 0x00000000 > >>> [ 130.604032] Call Trace: > >>> [ 130.604032] [<ffffffff90160f5d>] ? > >>> rcu_eqs_enter_common.isra.30+0x3d/0xf0 > >>> [ 130.604032] [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0 > >>> [ 130.604032] [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0 > >>> [ 130.604032] [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0 > >>> [ 130.604032] [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270 > >>> [ 130.604032] [<ffffffff900000d5>] ? start_cpu+0x5/0x14 > >>> [ 549.654536] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} > >>> (detected by 24, t=60002 jiffies, g=6922, c=6921, q=7013) > >>> [ 549.655463] Task dump for CPU 26: > >>> [ 549.655463] swapper/26 R running task 0 0 1 > >>> 0x00000000 > >>> [ 549.655463] Call Trace: > >>> [ 549.655463] [<ffffffff90160f5d>] ? > >>> rcu_eqs_enter_common.isra.30+0x3d/0xf0 > >>> [ 549.655463] [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0 > >>> [ 549.655463] [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0 > >>> [ 549.655463] [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0 > >>> [ 549.655463] [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270 > >>> [ 549.655463] [<ffffffff900000d5>] ? start_cpu+0x5/0x14 > >>> [ 821.888478] INFO: rcu_sched detected stalls on CPUs/tasks: { 26} > >>> (detected by 24, t=60002 jiffies, g=8499, c=8498, q=7664) > >>> [ 821.888596] Task dump for CPU 26: > >>> [ 821.888622] swapper/26 R running task 0 0 1 > >>> 0x00000000 > >>> [ 821.888677] Call Trace: > >>> [ 821.888712] [<ffffffff90160f5d>] ? > >>> rcu_eqs_enter_common.isra.30+0x3d/0xf0 > >>> [ 821.888771] [<ffffffff907b9bde>] ? default_idle+0x1e/0xd0 > >>> [ 821.888818] [<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0 > >>> [ 821.888865] [<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0 > >>> [ 821.888917] [<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270 > >>> [ 821.888966] [<ffffffff900000d5>] ? start_cpu+0x5/0x14 > >>> > >>> This is obviously undesirable. One way to bodge the issue would be to > >>> ignore VCPU_SSHOTTMR_future, but that's a deliberate breakage of the > >>> hypercall ABI. > >>> > >>> Instead lower the contention in the lock by doing the recalculation > >>> with the lock in read mode. This is safe because only the flags/type > >>> are changed, there's no PTE mfn change in the AMD recalculation logic. > >>> The Intel (EPT) case is likely more complicated, as superpage > >>> splitting for diverging EMT values must be done with the p2m lock in > >>> taken in write mode. > >>> > >>> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> > >>> --- > >>> I'm unsure whether such modification is fully safe: I think changing > >>> the flags/type should be fine: the PTE write is performed using > >>> safwrite_p2m_entry() which must be atomic (as the guest is still > >>> running and accessing the page tables). I'm slightly worried about > >>> all PTE readers not using atomic accesses to do so (ie: pointer > >>> returned by p2m_find_entry() should be read atomicallly), and code > >>> assuming that a gfn type cannot change while holding the p2m lock in > >>> read mode. > >> > >> Coming back to this: Yes, I think reads (at least the ones in do_recalc() > >> which can now be done in parallel) will need to be tightened if this is a > >> road we want to follow. > > > > There are likely a lot of reads under the p2m read lock outside of > > do_recalc() that will ideally need to be switched to use atomic > > accesses also? > > Possibly, perhaps even likely. I specifically said "at least". But ones > clearly on write-locked paths could probably be left alone. > > > I'm open to suggestions to other ways to get this sorted. And that's > > a guest with 'just' 32 vCPUs, as we go up the contention on the p2m > > lock during recalcs/misconfigs is going to increase massively. > > I'm afraid I don't have any really good idea, but I'm wondering whether > trylock with (almost?) immediate exit back to guest might make this any > better. At least the guest could then take interrupts before the insn > is retried. Another thought in this direction would be to have a variant > of trylock which "senses" how contended the lock is, to spin if it's the > first one to wait, but exit (fail) otherwise. Using trylock in the recalc path could starve those quite badly, as readers can acquire the lock concurrently. Also we would loose the fairness. Using trylock on VCPUOP_set_singleshot_timer (in order to fetch the data from the guest provided pointer) would lead to the same situation AFAICT, as guests using VCPU_SSHOTTMR_future will likely see the time expired by the point the hypercall checks it. One thing that I've noticed is that copy_from_guest for HVM (__hvm_copy()) takes and releases the p2m lock in read mode at several points. It would likely be better if the whole GVA -> MFN translation was resolved inside of a single read-locked region, as that would avoid stalls in the middle of the operation, but I don't think this will solve the issue at hand. Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.