[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: stable-4.18: reliably crash network driver domain by squeezing free_memory



On Thu, Nov 28, 2024 at 03:39:07PM +0000, Andrew Cooper wrote:
> On 28/11/2024 3:31 pm, James Dingwall wrote:
> > Hi,
> >
> > We have reproducible issue with the current HEAD of the stable-4.18 branch
> > which crashes a network driver domain and on some hardware subsequently
> > results in a dom0 crash.
> >
> > `xl info` reports: free_memory : 39961, configuring a guest with
> > memory = 39800 and starting it gives the log as below.  This is intel
> > hardware so if I've followed the code correctly I think this leads through
> > to intel_iommu_map_page() from drivers/passthrough/vtd/iommu.c.
> >
> > The expectation is that we can safely allocate up to free_memory for a
> > guest without any issue.  Is there any extra logging we could enable to
> > gain more information?
> 
> For this, you really should CC the x86 maintainers, or it stands a
> chance of getting missed.
> 
> Do you have the complete serial log including boot and eventual crash ?
> 
> -12 is -ENOMEM so something is wonky, and while dom2 is definitely dead
> at this point, Xen ought to be able to unwind cleanly and not take down
> dom0 too.
> 
> ~Andrew

<snipped the original crash report since it is also in the attached logs>

I've attached complete serial console logs from an Intel and an AMD dom0
which show similar behaviour.  The dom0 crash originally mentioned was
resolved by updating a patch for OpenZFS issue #15140 and no longer
occurs.

During the capture of the serial console logs I noted that:

1. If the order that the domains start is different then there is no crash.
   Restarting the domain later will lead to the driver domain crash even
   without a configuration change.
2. If the domU memory is closer to free_memory but still less than the
   domain fails to start with libxl reporting not enough memory.

So there is some undefined range for (free_memory - m) to (free_memory - n)
where it is possible to crash the driver domain depending on the guest
startup ordering.  My (perhaps naive) reasoning would be that
free_memory is the resource available to safely assign without having to
allow for some unknown overhead and if I do ask for too much then I
get a 'safe' failure.

Thanks,
James

Attachment: amd-enomem.txt
Description: Text document

Attachment: intel-enomem.txt
Description: Text document


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.