Xen project Mailing List

Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Tue, 13 Jan 2026 15:01:16 +0100

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=2DuMkhpSsc0/+mMpGd2m3Wpu6gUezUsD1TdTB/KCR+k=; b=hdrWRH7Z9cpPhHIOyOYNCH7z9KkobCMiKWxWNmzBOqc+LxFojtq7W9Nz6ijGo9xeoPBPS+wUefYWY39ywkCon5LDzyUW84OxqPHG9NZxiqRVqWtnavpl8Ve9E/+LnnGfBKgKV2xmOChOzO1qXFFH9Dyo9qOc1WlWloNLA9he6sG0oLTdkZjfGTbYM3bTAskz1bo2TOt2kfhKNTBoQkSFRT0uUag1IJwhxpaLL4siP4RU3WyQCFjsE6r6K2sBMHO8IC0FSusec6rmzpx3gEKsM5sWj9tjNfKubOdLGBJGPRjdlVV/0gM40eZt0GE9yJOsEktXVK6nnGoNKJJCh7mFHw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=mQ1M/waEsnI8GeQTLxFT3j1ZtpkNd2V9bNPGe1BuPoGI0I/yBFH0QunLAKvSpAeB94ahkYyiO2hlz5dWQfsuaCUXEfh9Sk2/q6skpWVoOIeWscDh2siClyXyE98WKWZF44eL83OW14NFQYi/FftUtXFab1dLEmHehOAZ0qTDH+6FPjCCB9x45fzrwurXJHU53OQ9K+eQbzCzWC/KO0LDThU+xtyE5dABcrGf7LryVnjiOov9rn941uNR2CTEYmO73Eyz4H3Fn1kob3aEmNmQlrzaiOwjJbYljoAzsqOFiuXa2oJmA2g/l9hYX8ZO5aW8eeUZfU9GUpliOdRZO0h3oA==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Tue, 13 Jan 2026 14:01:40 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote: > On 08.01.2026 18:55, Roger Pau Monne wrote: > > The current model of falling back to allocate unscrubbed pages and scrub > > them in place at allocation time risks triggering the watchdog: > > > > Watchdog timer detects that CPU55 is stuck! > > ----[ Xen-4.17.5-21 x86_64 debug=n Not tainted ]---- > > CPU: 55 > > RIP: e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30 > > RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v12) > > [...] > > Xen call trace: > > [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30 > > [<ffff82d04022a121>] S clear_domain_page+0x11/0x20 > > [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0 > > [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180 > > [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0 > > [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970 > > > > The maximum allocation order on x86 is limited to 18, that means allocating > > and scrubbing possibly 1G worth of memory in 4K chunks. > > > > Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is > > currently set to 2M chunks. However such limitation might cause > > fragmentation in HVM p2m population during domain creation. To prevent > > that introduce some extra logic in populate_physmap() that fallback to > > preemptive page-scrubbing if the requested allocation cannot be fulfilled > > and there's scrubbing work to do. This approach is less fair than the > > current one, but allows preemptive page scrubbing in the context of > > populate_physmap() to attempt to ensure unnecessary page-shattering. > > > > Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed") > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> > > --- > > I'm not particularly happy with this approach, as it doesn't guarantee > > progress for the callers. IOW: a caller might do a lot of scrubbing, just > > to get it's pages stolen by a different concurrent thread doing > > allocations. However I'm not sure there's a better solution than resorting > > to 2M allocations if there's not enough free memory that is scrubbed. > > > > I'm having trouble seeing where we could temporary store page(s) allocated > > that need to be scrubbed before being assigned to the domain, in a way that > > can be used by continuations, and that would allow Xen to keep track of > > them in case the operation is never finished. IOW: we would need to > > account for cleanup of such temporary stash of pages in case the domain > > never completes the hypercall, or is destroyed midway. > > How about stealing a bit from the range above MEMOP_EXTENT_SHIFT to > indicate that state, with the actual page (and order plus scrub progress) > recorded in the target struct domain? Actually, maybe such an indicator > isn't needed at all: If the next invocation (continuation or not) finds > an in-progress allocation, it could simply use that rather than doing a > real allocation. (What to do if this isn't a continuation is less clear: > We could fail such requests [likely not an option unless we can reliably > tell original requests from continuations], or split the allocation if > the request is smaller, or free the allocation to then take the normal > path.) All of which of course only for "foreign" requests. > > If the hypercall is never continued, we could refuse to unpause the > domain (with the allocation then freed normally when the domain gets > destroyed). I have done something along this lines, introduced a couple of stashing variables in the domain struct and stored the progress of scrubbing in there. > As another alternative, how about returning unscrubbed pages altogether > when it's during domain creation, requiring the tool stack to do the > scrubbing (potentially allowing it to skip some of it when pages are > fully initialized anyway, much like we do for Dom0 iirc)? It's going to be difficult for the toolstack to figure out which pages need to be scrubbed, we would need a way to tell it the unscrubbed regions in a domain physmap? > > --- a/xen/common/memory.c > > +++ b/xen/common/memory.c > > @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a) > > > > if ( unlikely(!page) ) > > { > > + nodeid_t node = MEMF_get_node(a->memflags); > > + > > + if ( memory_scrub_pending(node) || > > + (node != NUMA_NO_NODE && > > + !(a->memflags & MEMF_exact_node) && > > + memory_scrub_pending(node = NUMA_NO_NODE)) ) > > + { > > + scrub_free_pages(node); > > + a->preempted = 1; > > + goto out; > > + } > > At least for order 0 requests there's no point in trying this. With the > current logic, actually for orders up to MAX_DIRTY_ORDER. Yes, otherwise we might force the CPU to do some scrubbing work when it won't satisfy it's allocation request anyway. > Further, from a general interface perspective, wouldn't we need to do the > same for at least XENMEM_increase_reservation? Possibly yes. TBH I would also be fine with strictly limiting XENMEM_increase_reservation to 2M order extents, even for the control domain. The physmap population is the only that actually requires bigger extents. > > @@ -1115,7 +1139,16 @@ static struct page_info *alloc_heap_pages( > > if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) ) > > { > > if ( !(memflags & MEMF_no_scrub) ) > > + { > > scrub_one_page(&pg[i], cold); > > + /* > > + * Use SYS_STATE_smp_boot explicitly; ahead of that > > state > > + * interrupts are disabled. > > + */ > > + if ( system_state == SYS_STATE_smp_boot && > > + !(dirty_cnt & 0xff) ) > > + process_pending_softirqs(); > > + } > > > > dirty_cnt++; > > } > > Yet an alternative consideration: When "cold" is true, couldn't we call > process_pending_softirqs() like you do here ( >= SYS_STATE_smp_boot then > of course), without any of the other changes? Of course that's worse > than a proper continuation, especially from the calling domain's pov. Overall I think it would be best to solve this with hypercall continuations, in case we even want to support pages bigger than 1G. I know this has a lot of other implications, but would be nice to not add more baggage here. The "cold" case is the typical scenario for domain building, and we would block a control domain CPU for more than 5s which seems undesirable. > > @@ -223,6 +224,14 @@ struct npfec { > > #else > > #define MAX_ORDER 20 /* 2^20 contiguous pages */ > > #endif > > + > > +/* Max order when scrubbing pages at allocation time. */ > > +#ifdef CONFIG_DOMU_MAX_ORDER > > +# define MAX_DIRTY_ORDER CONFIG_DOMU_MAX_ORDER > > +#else > > +# define MAX_DIRTY_ORDER 9 > > +#endif > > Using CONFIG_DOMU_MAX_ORDER rather than the command line overridable > domu_max_order means people couldn't even restore original behavior. We likely want a separate command line option for this one, but given your comments above we might want to explore other options. Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.