[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH for-4.20 1/2] x86/shutdown: quiesce devices ahead of AP shutdown



On Wed, Jan 29, 2025 at 11:13:09AM +0100, Jan Beulich wrote:
> On 28.01.2025 17:27, Roger Pau Monne wrote:
> > The current shutdown logic in smp_send_stop() will first disable the APs,
> > and then attempt to disable (some) of the interrupt sources.
> > 
> > There are two issues with this approach; the first one being that MSI
> > interrupt sources are not disabled, the second one is the APs are stopped
> > before interrupts are disabled.  On AMD systems this can lead to the
> > triggering of local APIC errors:
> > 
> > APIC error on CPU0: 00(08), Receive accept error
> > 
> > Such error message can be printed in a loop, thus blocking the system from
> > rebooting.  I assume this loop is created by the error being triggered by
> > the console interrupt, which is further triggered by the ESR reporting
> > write to the console.
> > 
> > Intel SDM states:
> > 
> > "Receive Accept Error.
> > 
> > Set when the local APIC detects that the message it received was not
> > accepted by any APIC on the APIC bus, including itself. Used only on P6
> > family and Pentium processors."
> > 
> > So the error shouldn't trigger on any Intel CPU supported by Xen.
> > 
> > However AMD doesn't make such claims, and indeed the error is broadcasted
> > to all local APIC when for example an interrupt targets a CPU that's
> > offline.
> > 
> > To prevent the error from triggering, move the masking of IO-APIC pins
> > ahead of stopping the APs.  Also introduce a new function that disables
> > MSI and MSI-X on all PCI devices.  Remove the call to fixup_irqs() since
> > there's no point in attempting to move interrupts: all sources will be
> > either masked or disabled.
> > 
> > For the NMI crash path also call the newly introduced function, with the
> > hope that disabling MSI and MSI-X will make it easier for the (possible)
> > crash kernel to boot, as it could otherwise receive the same "Receive
> > accept error" upon re-enabling interrupts.
> > 
> > Note that this will have the side-effect of preventing further IOMMU
> > interrupts from being delivered, that's expected and at that point in the
> > shutdown process no further interaction with the IOMMU should be relevant.
> 
> This is at most for AMD only. Shouldn't we similarly disable VT-d's
> interrupt(s)? (It's only one right now, as we still don't use the QI
> completion one.) Even for AMD I'm uncertain: It has separate
> hw_irq_controller instances, and its set_iommu_interrupt_handler() is
> custom as well. Will pci_disable_msi_all() really affect it? (Hmm,
> yes, from amd_iommu_msi_enable() it looks like it will.)

I was only partly right, the XT interrupt type will still need to be
disabled in a custom way, as there's no associated MSI(-X) capability
in that case.

> 
> > --- a/xen/arch/x86/msi.c
> > +++ b/xen/arch/x86/msi.c
> > @@ -1248,6 +1248,20 @@ void pci_cleanup_msi(struct pci_dev *pdev)
> >      msi_free_irqs(pdev);
> >  }
> >  
> > +static int cf_check disable_msi(struct pci_dev *pdev, void *arg)
> > +{
> > +    msi_set_enable(pdev, 0);
> > +    msix_set_enable(pdev, 0);
> > +
> > +    return 0;
> > +}
> > +
> > +void pci_disable_msi_all(void)
> > +{
> > +    /* Disable MSI and/or MSI-X on all devices. */
> > +    pci_iterate_devices(disable_msi, NULL);
> > +}
> 
> That's going to be all devices we know of. I.e. best effort only. Maybe
> the comment should be adjusted to this effect.

Sure.

> > --- a/xen/arch/x86/smp.c
> > +++ b/xen/arch/x86/smp.c
> > @@ -358,14 +358,15 @@ void smp_send_stop(void)
> >  {
> >      unsigned int cpu = smp_processor_id();
> >  
> > +    local_irq_disable();
> > +    disable_IO_APIC();
> > +    pci_disable_msi_all();
> > +    local_irq_enable();
> > +
> >      if ( num_online_cpus() > 1 )
> >      {
> >          int timeout = 10;
> >  
> > -        local_irq_disable();
> > -        fixup_irqs(cpumask_of(cpu), 0);
> > -        local_irq_enable();
> > -
> >          smp_call_function(stop_this_cpu, NULL, 0);
> >  
> >          /* Wait 10ms for all other CPUs to go offline. */
> > @@ -376,7 +377,6 @@ void smp_send_stop(void)
> >      if ( cpu_online(cpu) )
> >      {
> >          local_irq_disable();
> > -        disable_IO_APIC();
> >          hpet_disable();
> 
> Like IOMMUs, HPET also has custom interrupt management. I think this
> call needs pulling up, too (much like it is also there in
> nmi_shootdown_cpus()).

Indeed, I wasn't taking into account the FSB capability.

> 
> > --- a/xen/drivers/passthrough/pci.c
> > +++ b/xen/drivers/passthrough/pci.c
> > @@ -1803,6 +1803,38 @@ int iommu_do_pci_domctl(
> >      return ret;
> >  }
> >  
> > +struct segment_iter {
> > +    int (*handler)(struct pci_dev *pdev, void *arg);
> > +    void *arg;
> > +};
> > +
> > +static int cf_check iterate_all(struct pci_seg *pseg, void *arg)
> > +{
> > +    const struct segment_iter *iter = arg;
> > +    struct pci_dev *pdev;
> > +
> > +    list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
> > +    {
> > +        int rc = iter->handler(pdev, iter->arg);
> > +
> > +        if ( rc )
> > +            return rc;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +int pci_iterate_devices(int (*handler)(struct pci_dev *pdev, void *arg),
> > +                        void *arg)
> > +{
> > +    struct segment_iter iter = {
> > +        .handler = handler,
> > +        .arg = arg,
> > +    };
> > +
> > +    return pci_segments_iterate(iterate_all, &iter);
> > +}
> 
> For the specific purpose during shutdown it may be okay to do all of this
> without locking (but see below) and without preemption checks. Yet then a
> warning will want putting here to indicate that from other environments
> this isn't okay to use as-is.
> 
> This use then also requires that msi{,x}_set_enable() paths never gain
> lock-related assertions.

Good point.  It might be better to just wrap the code in
pci_iterate_devices() with pcidevs_{,un}lock().

> Talking of the lack of locking: Since you invoke the disabling before
> bringing down APs, we're ending up in kind of a chicken and egg problem
> here: Without APs quiesced, there may be operations in progress there
> which conflict with the disabling done here. Hence why so far we brought
> down APs first.

I could implement a synchronized approach with the BSP only doing the
interrupt disabling after the APs are in stop_this_cpu() with
interrupts disabled, but before the calling __stop_this_cpu().

My thinking for doing it the proposed way is that MSI(-X) capability
enable is not toggled by Xen as part of interrupt handling, and hence
there should be no intent to enable the capabilities back after being
disabled by the BSP.

> With this special-purpose use I further wonder whether iterate_all()
> wouldn't better continue despite an error coming back from a callback
> (and also arrange for pci_segments_iterate() to continue, by merely
> recording any possible error in struct segment_iter), and only accumulate
> the error code to eventually return. The more devices we manage to
> quiesce, the better our chances of rebooting cleanly.

Fair enough, will do.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.