[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: stable-4.18: reliably crash network driver domain by squeezing free_memory
On Thu, Nov 28, 2024 at 03:39:07PM +0000, Andrew Cooper wrote: > On 28/11/2024 3:31 pm, James Dingwall wrote: > > Hi, > > > > We have reproducible issue with the current HEAD of the stable-4.18 branch > > which crashes a network driver domain and on some hardware subsequently > > results in a dom0 crash. > > > > `xl info` reports: free_memory : 39961, configuring a guest with > > memory = 39800 and starting it gives the log as below. This is intel > > hardware so if I've followed the code correctly I think this leads through > > to intel_iommu_map_page() from drivers/passthrough/vtd/iommu.c. > > > > The expectation is that we can safely allocate up to free_memory for a > > guest without any issue. Is there any extra logging we could enable to > > gain more information? > > For this, you really should CC the x86 maintainers, or it stands a > chance of getting missed. > > Do you have the complete serial log including boot and eventual crash ? > > -12 is -ENOMEM so something is wonky, and while dom2 is definitely dead > at this point, Xen ought to be able to unwind cleanly and not take down > dom0 too. > > ~Andrew <snipped the original crash report since it is also in the attached logs> I've attached complete serial console logs from an Intel and an AMD dom0 which show similar behaviour. The dom0 crash originally mentioned was resolved by updating a patch for OpenZFS issue #15140 and no longer occurs. During the capture of the serial console logs I noted that: 1. If the order that the domains start is different then there is no crash. Restarting the domain later will lead to the driver domain crash even without a configuration change. 2. If the domU memory is closer to free_memory but still less than the domain fails to start with libxl reporting not enough memory. So there is some undefined range for (free_memory - m) to (free_memory - n) where it is possible to crash the driver domain depending on the guest startup ordering. My (perhaps naive) reasoning would be that free_memory is the resource available to safely assign without having to allow for some unknown overhead and if I do ask for too much then I get a 'safe' failure. Thanks, James Attachment:
amd-enomem.txt Attachment:
intel-enomem.txt
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |