[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v8 13/13] xen/arm: Add support for system suspend triggered by hardware domain


  • To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
  • From: Mykola Kvach <xakep.amatop@xxxxxxxxx>
  • Date: Fri, 8 May 2026 11:37:02 +0300
  • Arc-authentication-results: i=1; mx.google.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=wpuQRbOQwrvqrL5TV1TLO/gtVPLeB51qcMTr2o4owZ8=; fh=+B/FOpuBiPJXqAXn9GQiahGpdb0p7rrV8Rb8Q45Y2iA=; b=Kx9dwFOmbQDSYUmbmyeIcBh4anQDYQtGEDDMHZ3OVLmVXfZfTkJy0DEQ+HRsP4CUSU BSoKa4Z0xFhlGt7rblpfFRA+7MW/k6s4Nyj2miGggNypepFA1tNr7zsV4mLMJSPCtZeM 7HCvVCuZZwNrptj9YFtD3GzLVrl0ElfhbfqcRDmnmWXdcCtuN1A/A4jot62MsEciRFpw WpciZURj537Jaf5r2pJ4LBPgG2nQ4revfoMoyWSl4MzLY42F2woNeIR/0OzbmB8fNTHb 55pJTFwTXWDZN6io5RrWimZKHxUzXrWNk6LT9r+Q2kPezF1Yin1qztD2DVzFOHP9b+td x51w==; darn=lists.xenproject.org
  • Arc-seal: i=1; a=rsa-sha256; t=1778229435; cv=none; d=google.com; s=arc-20240605; b=XEufQMKJj34/59AFkU1u6f7520oyhSl4DYJb94SS5Zp7G/0cWggtzNRuSdRrQdmB44 BZDfkS1yd0+2ruvTDKnWF+57tJuTHeyASY4Vow3pLAEe68rc0glySwz77tl71CSkTS8W 0d00IC5nu3ipM5vuqUkGo3QN+dsAMUSDwM3+KD8DendRGYRiI2ahjMz3+9Fc59DWUasa LOB+gIXBddDsOoXyOhANdttsIn36yfYXcxQIAi+5qjGDfiTvz6HGfpq5Fmz6ESaJjgbC 2hI6C2QdvQqRqRVVJT5kxIy9zh/tCA11V7Ja01HX6SwUqe62IEa/QXEvQuyxwd+cCqqd oQiQ==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=20251104 header.d=gmail.com header.i="@gmail.com" header.h="Content-Transfer-Encoding:Cc:To:Subject:Message-ID:Date:From:In-Reply-To:References:MIME-Version"
  • Cc: Luca Fancellu <Luca.Fancellu@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Mykola Kvach <Mykola_Kvach@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Rahul Singh <Rahul.Singh@xxxxxxx>
  • Delivery-date: Fri, 08 May 2026 08:37:19 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi Volodymyr,

Thank you for the feedback.

On Fri, May 8, 2026 at 1:25 AM Volodymyr Babchuk
<Volodymyr_Babchuk@xxxxxxxx> wrote:
>
> Hi Mykola,
>
> Mykola Kvach <xakep.amatop@xxxxxxxxx> writes:
>
> [...]
>
> >> > +    status = can_system_suspend();
> >> > +    if ( status )
> >> > +    {
> >> > +        system_state = SYS_STATE_resume;
> >> > +        goto resume_scheduler;
> >>
> >> When we have an error and we get the resume_scheduler path, we apply back 
> >> the
> >> context of the guest saved previously in do_psci_1_0_system_suspend(), so 
> >> am I
> >> correct saying the guest won’t get any PSCI error back and we resume the 
> >> guest
> >> from the guest resume entrypoint?
> >>
> >> In case, should we have a different path that returns a PSCI error 
> >> (PSCI_*) into the guest
> >> x0, and skips the context restore?
> >
> > You are right about the current control flow: once the virtual
> > SYSTEM_SUSPEND request has been accepted and the domain has been parked, a
> > later failure in the Xen-wide suspend path resumes the domain through the 
> > normal
> > domain resume path, rather than returning a PSCI error from the original 
> > call.
> >
> > This is intentional in the current design. The virtual PSCI SYSTEM_SUSPEND
> > path parks the domain and saves its resume context. The actual Xen-wide host
> > suspend is a separate step that is attempted only after all domains are
> > suspended.
> >
> > So a failure in the later Xen-wide suspend step is treated as an abort of 
> > the
> > host suspend attempt after the domain suspend was already accepted. The 
> > domain
> > is then resumed through the existing domain resume path, similarly to the
> > toolstack/xl suspend-resume flow, rather than by re-entering the guest PSCI
> > call path and modifying the saved vCPU context again.
> >
> > I agree this design is not obvious from the patch. I will clarify the commit
> > message and comments. If you or the maintainers think that failures before 
> > the
> > physical SYSTEM_SUSPEND call succeeds should be reported back through the
> > original virtual PSCI call, then this would require a different flow. I was
> > trying to avoid that extra complexity in this series.
>
> I think that there is no sense to reporting an error back to guest. PSCI
> allows resume at any stage, so it is acceptable to have such brief "suspend"
>
> >
> >>
> >> > +    }
> >> > +
> >> > +    /*
> >> > +     * Non-boot CPUs have to be disabled on suspend and enabled on 
> >> > resume
> >> > +     * (hotplug-based mechanism). Disabling non-boot CPUs will lead to 
> >> > PSCI
> >> > +     * CPU_OFF to be called by each non-boot CPU. Depending on the 
> >> > underlying
> >> > +     * platform capabilities, this may lead to the physical powering 
> >> > down of
> >> > +     * CPUs.
> >> > +     */
> >> > +    status = disable_nonboot_cpus();
> >> > +    if ( status )
> >> > +    {
> >> > +        system_state = SYS_STATE_resume;
> >> > +        goto resume_nonboot_cpus;
> >> > +    }
> >> > +
> >> > +    time_suspend();
> >> > +
> >> > +    status = iommu_suspend();
> >> > +    if ( status )
> >> > +    {
> >> > +        system_state = SYS_STATE_resume;
> >> > +        goto resume_time;
> >> > +    }
> >> > +
> >> > +    console_start_sync();
> >> > +    status = console_suspend();
> >> > +    if ( status )
> >> > +    {
> >> > +        dprintk(XENLOG_ERR, "Failed to suspend the console, err=%d\n", 
> >> > status);
> >> > +        system_state = SYS_STATE_resume;
> >> > +        goto resume_end_sync;
> >> > +    }
> >> > +
> >> > +    local_irq_save(flags);
> >> > +    status = gic_suspend();
> >> > +    if ( status )
> >> > +    {
> >> > +        system_state = SYS_STATE_resume;
> >> > +        goto resume_irqs;
> >> > +    }
> >> > +
> >> > +    set_init_ttbr(xen_pgtable);
> >> > +
> >> > +    /*
> >> > +     * Enable identity mapping before entering suspend to simplify
> >> > +     * the resume path
> >> > +     */
> >> > +    update_boot_mapping(true);
> >> > +
> >> > +    if ( prepare_resume_ctx(&cpu_context) )
> >> > +    {
> >> > +        status = call_psci_system_suspend();
> >> > +        /*
> >> > +         * If suspend is finalized properly by above system suspend 
> >> > PSCI call,
> >> > +         * the code below in this 'if' branch will never execute. 
> >> > Execution
> >> > +         * will continue from hyp_resume which is the hypervisor's 
> >> > resume point.
> >> > +         * In hyp_resume CPU context will be restored and since 
> >> > link-register is
> >> > +         * restored as well, it will appear to return from 
> >> > prepare_resume_ctx.
> >> > +         * The difference in returning from prepare_resume_ctx on 
> >> > system suspend
> >> > +         * versus resume is in function's return value: on suspend, the 
> >> > return
> >> > +         * value is a non-zero value, on resume it is zero. That is why 
> >> > the
> >> > +         * control flow will not re-enter this 'if' branch on resume.
> >> > +         */
> >> > +        if ( status )
> >> > +            dprintk(XENLOG_WARNING, "PSCI system suspend failed, 
> >> > err=%d\n",
> >> > +                    status);
> >> > +    }
> >> > +
> >> > +    system_state = SYS_STATE_resume;
> >> > +    update_boot_mapping(false);
> >> > +
> >> > +    gic_resume();
> >> > +
> >> > + resume_irqs:
> >> > +    local_irq_restore(flags);
> >> > +
> >> > +    console_resume();
> >> > + resume_end_sync:
> >> > +    console_end_sync();
> >> > +
> >> > +    iommu_resume();
> >> > +
> >> > + resume_time:
> >> > +    time_resume();
> >> > +
> >> > + resume_nonboot_cpus:
> >> > +    /*
> >> > +     * The rcu_barrier() has to be added to ensure that the per cpu 
> >> > area is
> >> > +     * freed before a non-boot CPU tries to initialize it 
> >> > (_free_percpu_area()
> >> > +     * has to be called before the init_percpu_area()). This scenario 
> >> > occurs
> >> > +     * when non-boot CPUs are hot-unplugged on suspend and hotplugged 
> >> > on resume.
> >> > +     */
> >> > +    rcu_barrier();
> >> > +    enable_nonboot_cpus();
> >> > +
> >> > + resume_scheduler:
> >> > +    scheduler_enable();
> >> > +    thaw_domains();
> >> > +
> >> > +    system_state = SYS_STATE_active;
> >> > +
> >> > +    printk("Resume (status %d)\n", status);
> >> > +
> >> > +    domain_resume(d);
> >> > +}
> >> > +
> >> > +static DECLARE_TASKLET(system_suspend_tasklet, system_suspend, NULL);
> >> > +
> >> > +void host_system_suspend(struct domain *d)
> >> > +{
> >> > +    system_suspend_tasklet.data = (void *)d;
> >> > +    /*
> >> > +     * The suspend procedure has to be finalized by the pCPU#0 
> >> > (non-boot pCPUs
> >> > +     * will be disabled during the suspend).
> >> > +     */
> >> > +    tasklet_schedule_on_cpu(&system_suspend_tasklet, 0);
> >> > +}
> >> > +
> >> > /*
> >> >  * Local variables:
> >> >  * mode: C
> >> > diff --git a/xen/arch/arm/vpsci.c b/xen/arch/arm/vpsci.c
> >> > index bd87ec430d..8fb9172186 100644
> >> > --- a/xen/arch/arm/vpsci.c
> >> > +++ b/xen/arch/arm/vpsci.c
> >> > @@ -5,6 +5,7 @@
> >> >
> >> > #include <asm/current.h>
> >> > #include <asm/domain.h>
> >> > +#include <asm/suspend.h>
> >> > #include <asm/vgic.h>
> >> > #include <asm/vpsci.h>
> >> > #include <asm/event.h>
> >> > @@ -232,8 +233,7 @@ static int32_t do_psci_1_0_system_suspend(register_t 
> >> > epoint, register_t cid)
> >> >     if ( is_64bit_domain(d) && is_thumb )
> >> >         return PSCI_INVALID_ADDRESS;
> >> >
> >> > -    /* SYSTEM_SUSPEND is not supported for the hardware domain yet */
> >> > -    if ( is_hardware_domain(d) )
> >> > +    if ( !IS_ENABLED(CONFIG_SYSTEM_SUSPEND) && is_hardware_domain(d) )
> >> >         return PSCI_NOT_SUPPORTED;
> >> >
> >> >     /* Ensure that all CPUs other than the calling one are offline */
> >> > @@ -266,6 +266,9 @@ static int32_t do_psci_1_0_system_suspend(register_t 
> >> > epoint, register_t cid)
> >> >             "SYSTEM_SUSPEND requested, epoint=%#"PRIregister", 
> >> > cid=%#"PRIregister"\n",
> >> >             epoint, cid);
> >> >
> >> > +    if ( is_control_domain(d) )
> >>
> >> Why is_control_domain() here and not is_hardware_domain() ?
> >
> > The use of is_control_domain() is intentional.
> >
> > The intended model is that Xen-wide host suspend is orchestrated by the
> > privileged management/control domain. The control domain coordinates the
> > toolstack side, asks other domains to enter suspend, and then issues the 
> > final
> > SYSTEM_SUSPEND request to Xen.
> >
> > This does not have to be the same entity as the hardware domain. If the
> > hardware domain is separate, it is one of the domains that the control 
> > domain
> > parks before the final host suspend step.
> >
> > The hwdom-specific checks in this patch have a different purpose: they avoid
> > the old hwdom_shutdown() path for SHUTDOWN_suspend and allow the hardware
> > domain to be parked as part of the suspend sequence. They do not define the
> > policy for who is allowed to trigger Xen-wide host suspend.
> >
> > That said, this policy may not be optimal for all configurations, especially
> > when the control and hardware domain roles are split. I would appreciate 
> > your
> > view, as well as the maintainers' views, on whether the trigger should 
> > remain
> > control-domain based, be tied to the hardware domain instead, or be 
> > expressed
> > through a separate host-suspend capability/helper.
>
>
> Hardware domain owns all the hardware. Hardware shall be put to
> power-down/suspended state before suspending the SoC, so it can be
> resumed afterwards. You can't just pause hardware domain in the same way
> as pausing all other domains.
>
> (Of course, we'll have the same issues with domain that have
> passed-through hardware, but in this case Dom0 shall orchestrate proper
> suspend sequence for these)

Yes, I agree that the hardware domain must not be externally
paused as a replacement for its own suspend path.

What I meant to describe is a guest-driven suspend sequence.
The control domain/toolstack may orchestrate the sequence,
but each domain that needs to quiesce hardware, including
the hardware domain and any domain with passed-through
devices, is expected to enter its own suspend path first and
quiesce its devices before issuing the virtual PSCI
SYSTEM_SUSPEND call.

Xen only treats other domains as ready for host suspend after
they have voluntarily reached SHUTDOWN_suspend. In the split
control/hardware-domain case, the final host-wide suspend
request from the control domain is accepted only after the
other domains, including the hardware domain, are already in
SHUTDOWN_suspend.

So my wording saying that the control domain "parks" the
hardware domain was imprecise. The control domain orchestrates
the sequence; it does not externally pause the hardware domain
as a substitute for its own suspend path.

Best regards,
Mykola

>
> [...]
>
> --
> WBR, Volodymyr



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.