|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 2/2] x86/boot: attempt to print trace and panic on AP bring up stall
On Wed, May 21, 2025 at 07:16:18PM +0100, Andrew Cooper wrote:
> On 21/05/2025 5:55 pm, Roger Pau Monne wrote:
> > With the current AP bring up code Xen can get stuck indefinitely if an AP
>
> You want a comma between "code, Xen" to make the sentence easier to parse.
>
> > freezes during boot after the 'callin' step. Introduce a 10s timeout while
> > waiting for APs to finish startup.
>
> 5s is the timeout used in other parts of AP bringup. I'd suggest being
> consistent here.
>
>
> > On failure of an AP to complete startup send an NMI to trigger the printing
>
> Again, a comma between "startup, send" would go a long way.
>
> > of a stack backtrace on the stuck AP and panic on the BSP.
> >
> > The sending of the NMI re-uses the code already present in fatal_trap(), by
> > moving it to a separate function.
>
> I'd be tempted to split the patch in two.
>
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > ---
>
> It may be worth nothing that this came from the ICX143 investigation,
> even if it wasn't relevant in the end?
>
> > diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> > index 48ce996ba414..77dce3e3e22b 100644
> > --- a/xen/arch/x86/smpboot.c
> > +++ b/xen/arch/x86/smpboot.c
> > @@ -1388,10 +1389,17 @@ int __cpu_up(unsigned int cpu)
> > time_latch_stamps();
> >
> > set_cpu_state(CPU_STATE_ONLINE);
> > + start = NOW();
> > while ( !cpu_online(cpu) )
> > {
> > cpu_relax();
> > process_pending_softirqs();
> > + if ( NOW() > start + SECONDS(10) )
>
> (NOW() - start) > SECONDS(10)
>
> It has one fewer boundary conditions, even if it is rather unlikely that
> start + 10s will overflow.
>
> > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> > index c94779b4ad4f..9b9e3726e2fb 100644
> > --- a/xen/arch/x86/traps.c
> > +++ b/xen/arch/x86/traps.c
> > @@ -734,6 +736,40 @@ static int cf_check nmi_show_execution_state(
> > return 1;
> > }
> >
> > +void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
> > +{
> > + unsigned int msecs, pending;
> > +
> > + force_show_all = show_all;
> > +
> > + watchdog_disable();
> > + console_start_sync();
> > +
> > + cpumask_copy(&show_state_mask, mask);
> > + set_nmi_callback(nmi_show_execution_state);
> > + /* Ensure new callback is set before sending out the NMI. */
> > + smp_wmb();
>
> I know this is only moving code, but this is wrong. So is the smp_mb()
> in the x2apic drivers.
>
> It would only be correct in principle for xAPIC (which is an MMIO
> store), except it's UC and is strongly ordered anyway. Furthermore, the
> sequence point for the send_IPI_mask() prevents the compiler from
> reordering this unsafely.
>
> The x2APIC drivers need LFENCE;MFENCE on Intel, and just MFENCE on AMD,
> and this (critically) is not smp_mb(), which is now just a locked operation.
>
> I bet these aren't the only examples of incorrect barriers WRT IPIs. I
> guess we should fix those separately.
Thanks, I will remove the smp_wmb() ahead of moving the code, but
other instances in the APIC drivers I will leave for a different
series, I don't want to delay the work here on those fixes.
Regards, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |