[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: stable-4.18: reliably crash network driver domain by squeezing free_memory
On 02.12.2024 16:54, James Dingwall wrote: > On Thu, Nov 28, 2024 at 03:39:07PM +0000, Andrew Cooper wrote: >> On 28/11/2024 3:31 pm, James Dingwall wrote: >>> Hi, >>> >>> We have reproducible issue with the current HEAD of the stable-4.18 branch >>> which crashes a network driver domain and on some hardware subsequently >>> results in a dom0 crash. >>> >>> `xl info` reports: free_memory : 39961, configuring a guest with >>> memory = 39800 and starting it gives the log as below. This is intel >>> hardware so if I've followed the code correctly I think this leads through >>> to intel_iommu_map_page() from drivers/passthrough/vtd/iommu.c. >>> >>> The expectation is that we can safely allocate up to free_memory for a >>> guest without any issue. Is there any extra logging we could enable to >>> gain more information? >> >> For this, you really should CC the x86 maintainers, or it stands a >> chance of getting missed. >> >> Do you have the complete serial log including boot and eventual crash ? >> >> -12 is -ENOMEM so something is wonky, and while dom2 is definitely dead >> at this point, Xen ought to be able to unwind cleanly and not take down >> dom0 too. >> >> ~Andrew > > <snipped the original crash report since it is also in the attached logs> > > I've attached complete serial console logs from an Intel and an AMD dom0 > which show similar behaviour. The dom0 crash originally mentioned was > resolved by updating a patch for OpenZFS issue #15140 and no longer > occurs. > > During the capture of the serial console logs I noted that: > > 1. If the order that the domains start is different then there is no crash. > Restarting the domain later will lead to the driver domain crash even > without a configuration change. > 2. If the domU memory is closer to free_memory but still less than the > domain fails to start with libxl reporting not enough memory. > > So there is some undefined range for (free_memory - m) to (free_memory - n) > where it is possible to crash the driver domain depending on the guest > startup ordering. My (perhaps naive) reasoning would be that > free_memory is the resource available to safely assign without having to > allow for some unknown overhead and if I do ask for too much then I > get a 'safe' failure. As per the earlier reply I sent, this isn't the case. "free_memory" isn't adjusted to account for extra overhead. As per Marek's reply, you need to leave some spare, at least as of how things are right now. Jan PS: I've dropped security@. We're now firmly into discussing this in public, irrespective of possible security angles. I don't think that's what should have happened, but it also makes no sense to further pretend to attempt to first assess the full scope in private.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |