|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order
On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote:
> On 08.01.2026 18:55, Roger Pau Monne wrote:
> > The current model of falling back to allocate unscrubbed pages and scrub
> > them in place at allocation time risks triggering the watchdog:
> >
> > Watchdog timer detects that CPU55 is stuck!
> > ----[ Xen-4.17.5-21 x86_64 debug=n Not tainted ]----
> > CPU: 55
> > RIP: e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
> > RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v12)
> > [...]
> > Xen call trace:
> > [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
> > [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
> > [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
> > [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
> > [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
> > [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
> >
> > The maximum allocation order on x86 is limited to 18, that means allocating
> > and scrubbing possibly 1G worth of memory in 4K chunks.
> >
> > Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is
> > currently set to 2M chunks. However such limitation might cause
> > fragmentation in HVM p2m population during domain creation. To prevent
> > that introduce some extra logic in populate_physmap() that fallback to
> > preemptive page-scrubbing if the requested allocation cannot be fulfilled
> > and there's scrubbing work to do. This approach is less fair than the
> > current one, but allows preemptive page scrubbing in the context of
> > populate_physmap() to attempt to ensure unnecessary page-shattering.
> >
> > Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
> > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > ---
> > I'm not particularly happy with this approach, as it doesn't guarantee
> > progress for the callers. IOW: a caller might do a lot of scrubbing, just
> > to get it's pages stolen by a different concurrent thread doing
> > allocations. However I'm not sure there's a better solution than resorting
> > to 2M allocations if there's not enough free memory that is scrubbed.
> >
> > I'm having trouble seeing where we could temporary store page(s) allocated
> > that need to be scrubbed before being assigned to the domain, in a way that
> > can be used by continuations, and that would allow Xen to keep track of
> > them in case the operation is never finished. IOW: we would need to
> > account for cleanup of such temporary stash of pages in case the domain
> > never completes the hypercall, or is destroyed midway.
>
> How about stealing a bit from the range above MEMOP_EXTENT_SHIFT to
> indicate that state, with the actual page (and order plus scrub progress)
> recorded in the target struct domain? Actually, maybe such an indicator
> isn't needed at all: If the next invocation (continuation or not) finds
> an in-progress allocation, it could simply use that rather than doing a
> real allocation. (What to do if this isn't a continuation is less clear:
> We could fail such requests [likely not an option unless we can reliably
> tell original requests from continuations], or split the allocation if
> the request is smaller, or free the allocation to then take the normal
> path.) All of which of course only for "foreign" requests.
>
> If the hypercall is never continued, we could refuse to unpause the
> domain (with the allocation then freed normally when the domain gets
> destroyed).
I have done something along this lines, introduced a couple of
stashing variables in the domain struct and stored the progress of
scrubbing in there.
> As another alternative, how about returning unscrubbed pages altogether
> when it's during domain creation, requiring the tool stack to do the
> scrubbing (potentially allowing it to skip some of it when pages are
> fully initialized anyway, much like we do for Dom0 iirc)?
It's going to be difficult for the toolstack to figure out which pages
need to be scrubbed, we would need a way to tell it the unscrubbed
regions in a domain physmap?
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a)
> >
> > if ( unlikely(!page) )
> > {
> > + nodeid_t node = MEMF_get_node(a->memflags);
> > +
> > + if ( memory_scrub_pending(node) ||
> > + (node != NUMA_NO_NODE &&
> > + !(a->memflags & MEMF_exact_node) &&
> > + memory_scrub_pending(node = NUMA_NO_NODE)) )
> > + {
> > + scrub_free_pages(node);
> > + a->preempted = 1;
> > + goto out;
> > + }
>
> At least for order 0 requests there's no point in trying this. With the
> current logic, actually for orders up to MAX_DIRTY_ORDER.
Yes, otherwise we might force the CPU to do some scrubbing work when
it won't satisfy it's allocation request anyway.
> Further, from a general interface perspective, wouldn't we need to do the
> same for at least XENMEM_increase_reservation?
Possibly yes. TBH I would also be fine with strictly limiting
XENMEM_increase_reservation to 2M order extents, even for the control
domain. The physmap population is the only that actually requires
bigger extents.
> > @@ -1115,7 +1139,16 @@ static struct page_info *alloc_heap_pages(
> > if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) )
> > {
> > if ( !(memflags & MEMF_no_scrub) )
> > + {
> > scrub_one_page(&pg[i], cold);
> > + /*
> > + * Use SYS_STATE_smp_boot explicitly; ahead of that
> > state
> > + * interrupts are disabled.
> > + */
> > + if ( system_state == SYS_STATE_smp_boot &&
> > + !(dirty_cnt & 0xff) )
> > + process_pending_softirqs();
> > + }
> >
> > dirty_cnt++;
> > }
>
> Yet an alternative consideration: When "cold" is true, couldn't we call
> process_pending_softirqs() like you do here ( >= SYS_STATE_smp_boot then
> of course), without any of the other changes? Of course that's worse
> than a proper continuation, especially from the calling domain's pov.
Overall I think it would be best to solve this with hypercall
continuations, in case we even want to support pages bigger than 1G.
I know this has a lot of other implications, but would be nice to not
add more baggage here.
The "cold" case is the typical scenario for domain building, and we
would block a control domain CPU for more than 5s which seems
undesirable.
> > @@ -223,6 +224,14 @@ struct npfec {
> > #else
> > #define MAX_ORDER 20 /* 2^20 contiguous pages */
> > #endif
> > +
> > +/* Max order when scrubbing pages at allocation time. */
> > +#ifdef CONFIG_DOMU_MAX_ORDER
> > +# define MAX_DIRTY_ORDER CONFIG_DOMU_MAX_ORDER
> > +#else
> > +# define MAX_DIRTY_ORDER 9
> > +#endif
>
> Using CONFIG_DOMU_MAX_ORDER rather than the command line overridable
> domu_max_order means people couldn't even restore original behavior.
We likely want a separate command line option for this one, but given
your comments above we might want to explore other options.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |