[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v8 13/13] xen/arm: Add support for system suspend triggered by hardware domain


  • To: Luca Fancellu <Luca.Fancellu@xxxxxxx>
  • From: Mykola Kvach <xakep.amatop@xxxxxxxxx>
  • Date: Tue, 5 May 2026 23:34:18 +0300
  • Arc-authentication-results: i=1; mx.google.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=x1jfw1bJF05hjjJ7T/x2gQU6csfYvm0MxMfH1I6JXOs=; fh=eEq7B6rg7ssf3icSookiwOw/XyVLUuCIMVm7gUWrf/I=; b=HNQjZRxOV4tr0+kckHzlqxIWBjz8ZEz44Oq6Lsj58J/ril8kjdDKkXUXLU6WLhZArf v8yW8Js1rOgYZJqAxbpyLtHTiGb/AwDzHdxWe14EsjDjUN+Q8RzHSsufTphk5/QgR6XO OYeQoNmKvM9Rz1EQbLGXYQALCHdf0W+GEFUc6bP8gQpJ/q7UDnTNo2tccLh62AKjv+7q zfI7WoeipWllY1QN0k4mOiUYXnsQ0ogFqe+K2KsUk8HX1qwhQOZpkQXv5OuQxtRH7Bdt TYmoHacLmX2yVyzGb+lV5lN3bER/Vi/6WaJYIAq7ycNpedRBsd5+PWDRhsaVqz7aGkva l8mQ==; darn=lists.xenproject.org
  • Arc-seal: i=1; a=rsa-sha256; t=1778013271; cv=none; d=google.com; s=arc-20240605; b=alPwggRqdE3sEh7sc9qxtDyJq8ZgfNMjarpKoVw/uzrQG6SVuHSMPyX+iXCgB8GFoS SKdU9LnzTOxzIAogRffGvE5oh8aw9ELm3vbe2X4NvvxPJSq86cU1Vqz/TFuMlmJ/+uXD uxDC1NjG+flT23eZWNwzXn2L99MKXK3uyqMExmSHE9aOzI21sMzKKemiIcvN5rp4Yt6u AkmruM6th6HEVxy/cwLZZBdkTXD+QhUY3Pj+fN03veVu28zRD3L9P2Tv0Sxym/hu8ZF6 0igS4OzCSiqpbjha22AliUW8Bxw9gRBGuMThXSVY0h7hIPAaSCbKvqWZgZ5Aj9aFkLYA hfsA==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=20251104 header.d=gmail.com header.i="@gmail.com" header.h="Content-Transfer-Encoding:Cc:To:Subject:Message-ID:Date:From:In-Reply-To:References:MIME-Version"
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Mykola Kvach <mykola_kvach@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Rahul Singh <Rahul.Singh@xxxxxxx>
  • Delivery-date: Tue, 05 May 2026 20:34:34 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi Luca,

Thanks for the feedback.

On Wed, Apr 29, 2026 at 11:06 AM Luca Fancellu <Luca.Fancellu@xxxxxxx> wrote:
>
> Hi Mykola,
>
> > diff --git a/xen/arch/arm/suspend.c b/xen/arch/arm/suspend.c
> > index e38566b0b7..4d1289776b 100644
> > --- a/xen/arch/arm/suspend.c
> > +++ b/xen/arch/arm/suspend.c
> > @@ -1,9 +1,190 @@
> > /* SPDX-License-Identifier: GPL-2.0-only */
> >
> > +#include <asm/psci.h>
> > #include <asm/suspend.h>
> >
> > +#include <public/sched.h>
> > +#include <xen/console.h>
> > +#include <xen/cpu.h>
> > +#include <xen/errno.h>
> > +#include <xen/iommu.h>
> > +#include <xen/sched.h>
> > +#include <xen/tasklet.h>
> > +
> > struct cpu_context cpu_context = {};
> >
> > +static int can_system_suspend(void)
> > +{
> > +    int ret = 0;
> > +    struct domain *d;
> > +
> > +    rcu_read_lock(&domlist_read_lock);
> > +
> > +    for_each_domain ( d )
> > +    {
> > +        bool domain_suspended;
> > +
> > +        spin_lock(&d->shutdown_lock);
> > +        domain_suspended = d->is_shut_down &&
> > +                           d->shutdown_code == SHUTDOWN_suspend;
> > +        spin_unlock(&d->shutdown_lock);
> > +
> > +        if ( domain_suspended )
> > +            continue;
> > +
> > +        printk(XENLOG_ERR
> > +               "System suspend requires all domains to be shut down for 
> > suspend (dom%d: isn't in suspend state)\n",
>
> d->domain_id is unsigned if I’m not mistaken, it wants %u (typedef uint16_t 
> domid_t;)

Ack, I will fix it in v9.

>
> > +               d->domain_id);
> > +
> > +        ret = -EBUSY;
> > +        break;
> > +    }
> > +
> > +    rcu_read_unlock(&domlist_read_lock);
> > +
> > +    return ret;
> > +}
> > +
> > +/* Xen suspend. data identifies the domain that initiated suspend. */
> > +static void system_suspend(void *data)
> > +{
> > +    int status;
> > +    unsigned long flags;
> > +    struct domain *d = (struct domain *)data;
> > +
> > +    BUG_ON(system_state != SYS_STATE_active);
> > +
> > +    system_state = SYS_STATE_suspend;
> > +
> > +    printk("Xen suspending...\n");
> > +
> > +    freeze_domains();
> > +    scheduler_disable();
> > +
> > +    status = can_system_suspend();
> > +    if ( status )
> > +    {
> > +        system_state = SYS_STATE_resume;
> > +        goto resume_scheduler;
>
> When we have an error and we get the resume_scheduler path, we apply back the
> context of the guest saved previously in do_psci_1_0_system_suspend(), so am I
> correct saying the guest won’t get any PSCI error back and we resume the guest
> from the guest resume entrypoint?
>
> In case, should we have a different path that returns a PSCI error (PSCI_*) 
> into the guest
> x0, and skips the context restore?

You are right about the current control flow: once the virtual
SYSTEM_SUSPEND request has been accepted and the domain has been parked, a
later failure in the Xen-wide suspend path resumes the domain through the normal
domain resume path, rather than returning a PSCI error from the original call.

This is intentional in the current design. The virtual PSCI SYSTEM_SUSPEND
path parks the domain and saves its resume context. The actual Xen-wide host
suspend is a separate step that is attempted only after all domains are
suspended.

So a failure in the later Xen-wide suspend step is treated as an abort of the
host suspend attempt after the domain suspend was already accepted. The domain
is then resumed through the existing domain resume path, similarly to the
toolstack/xl suspend-resume flow, rather than by re-entering the guest PSCI
call path and modifying the saved vCPU context again.

I agree this design is not obvious from the patch. I will clarify the commit
message and comments. If you or the maintainers think that failures before the
physical SYSTEM_SUSPEND call succeeds should be reported back through the
original virtual PSCI call, then this would require a different flow. I was
trying to avoid that extra complexity in this series.

>
> > +    }
> > +
> > +    /*
> > +     * Non-boot CPUs have to be disabled on suspend and enabled on resume
> > +     * (hotplug-based mechanism). Disabling non-boot CPUs will lead to PSCI
> > +     * CPU_OFF to be called by each non-boot CPU. Depending on the 
> > underlying
> > +     * platform capabilities, this may lead to the physical powering down 
> > of
> > +     * CPUs.
> > +     */
> > +    status = disable_nonboot_cpus();
> > +    if ( status )
> > +    {
> > +        system_state = SYS_STATE_resume;
> > +        goto resume_nonboot_cpus;
> > +    }
> > +
> > +    time_suspend();
> > +
> > +    status = iommu_suspend();
> > +    if ( status )
> > +    {
> > +        system_state = SYS_STATE_resume;
> > +        goto resume_time;
> > +    }
> > +
> > +    console_start_sync();
> > +    status = console_suspend();
> > +    if ( status )
> > +    {
> > +        dprintk(XENLOG_ERR, "Failed to suspend the console, err=%d\n", 
> > status);
> > +        system_state = SYS_STATE_resume;
> > +        goto resume_end_sync;
> > +    }
> > +
> > +    local_irq_save(flags);
> > +    status = gic_suspend();
> > +    if ( status )
> > +    {
> > +        system_state = SYS_STATE_resume;
> > +        goto resume_irqs;
> > +    }
> > +
> > +    set_init_ttbr(xen_pgtable);
> > +
> > +    /*
> > +     * Enable identity mapping before entering suspend to simplify
> > +     * the resume path
> > +     */
> > +    update_boot_mapping(true);
> > +
> > +    if ( prepare_resume_ctx(&cpu_context) )
> > +    {
> > +        status = call_psci_system_suspend();
> > +        /*
> > +         * If suspend is finalized properly by above system suspend PSCI 
> > call,
> > +         * the code below in this 'if' branch will never execute. Execution
> > +         * will continue from hyp_resume which is the hypervisor's resume 
> > point.
> > +         * In hyp_resume CPU context will be restored and since 
> > link-register is
> > +         * restored as well, it will appear to return from 
> > prepare_resume_ctx.
> > +         * The difference in returning from prepare_resume_ctx on system 
> > suspend
> > +         * versus resume is in function's return value: on suspend, the 
> > return
> > +         * value is a non-zero value, on resume it is zero. That is why the
> > +         * control flow will not re-enter this 'if' branch on resume.
> > +         */
> > +        if ( status )
> > +            dprintk(XENLOG_WARNING, "PSCI system suspend failed, err=%d\n",
> > +                    status);
> > +    }
> > +
> > +    system_state = SYS_STATE_resume;
> > +    update_boot_mapping(false);
> > +
> > +    gic_resume();
> > +
> > + resume_irqs:
> > +    local_irq_restore(flags);
> > +
> > +    console_resume();
> > + resume_end_sync:
> > +    console_end_sync();
> > +
> > +    iommu_resume();
> > +
> > + resume_time:
> > +    time_resume();
> > +
> > + resume_nonboot_cpus:
> > +    /*
> > +     * The rcu_barrier() has to be added to ensure that the per cpu area is
> > +     * freed before a non-boot CPU tries to initialize it 
> > (_free_percpu_area()
> > +     * has to be called before the init_percpu_area()). This scenario 
> > occurs
> > +     * when non-boot CPUs are hot-unplugged on suspend and hotplugged on 
> > resume.
> > +     */
> > +    rcu_barrier();
> > +    enable_nonboot_cpus();
> > +
> > + resume_scheduler:
> > +    scheduler_enable();
> > +    thaw_domains();
> > +
> > +    system_state = SYS_STATE_active;
> > +
> > +    printk("Resume (status %d)\n", status);
> > +
> > +    domain_resume(d);
> > +}
> > +
> > +static DECLARE_TASKLET(system_suspend_tasklet, system_suspend, NULL);
> > +
> > +void host_system_suspend(struct domain *d)
> > +{
> > +    system_suspend_tasklet.data = (void *)d;
> > +    /*
> > +     * The suspend procedure has to be finalized by the pCPU#0 (non-boot 
> > pCPUs
> > +     * will be disabled during the suspend).
> > +     */
> > +    tasklet_schedule_on_cpu(&system_suspend_tasklet, 0);
> > +}
> > +
> > /*
> >  * Local variables:
> >  * mode: C
> > diff --git a/xen/arch/arm/vpsci.c b/xen/arch/arm/vpsci.c
> > index bd87ec430d..8fb9172186 100644
> > --- a/xen/arch/arm/vpsci.c
> > +++ b/xen/arch/arm/vpsci.c
> > @@ -5,6 +5,7 @@
> >
> > #include <asm/current.h>
> > #include <asm/domain.h>
> > +#include <asm/suspend.h>
> > #include <asm/vgic.h>
> > #include <asm/vpsci.h>
> > #include <asm/event.h>
> > @@ -232,8 +233,7 @@ static int32_t do_psci_1_0_system_suspend(register_t 
> > epoint, register_t cid)
> >     if ( is_64bit_domain(d) && is_thumb )
> >         return PSCI_INVALID_ADDRESS;
> >
> > -    /* SYSTEM_SUSPEND is not supported for the hardware domain yet */
> > -    if ( is_hardware_domain(d) )
> > +    if ( !IS_ENABLED(CONFIG_SYSTEM_SUSPEND) && is_hardware_domain(d) )
> >         return PSCI_NOT_SUPPORTED;
> >
> >     /* Ensure that all CPUs other than the calling one are offline */
> > @@ -266,6 +266,9 @@ static int32_t do_psci_1_0_system_suspend(register_t 
> > epoint, register_t cid)
> >             "SYSTEM_SUSPEND requested, epoint=%#"PRIregister", 
> > cid=%#"PRIregister"\n",
> >             epoint, cid);
> >
> > +    if ( is_control_domain(d) )
>
> Why is_control_domain() here and not is_hardware_domain() ?

The use of is_control_domain() is intentional.

The intended model is that Xen-wide host suspend is orchestrated by the
privileged management/control domain. The control domain coordinates the
toolstack side, asks other domains to enter suspend, and then issues the final
SYSTEM_SUSPEND request to Xen.

This does not have to be the same entity as the hardware domain. If the
hardware domain is separate, it is one of the domains that the control domain
parks before the final host suspend step.

The hwdom-specific checks in this patch have a different purpose: they avoid
the old hwdom_shutdown() path for SHUTDOWN_suspend and allow the hardware
domain to be parked as part of the suspend sequence. They do not define the
policy for who is allowed to trigger Xen-wide host suspend.

That said, this policy may not be optimal for all configurations, especially
when the control and hardware domain roles are split. I would appreciate your
view, as well as the maintainers' views, on whether the trigger should remain
control-domain based, be tied to the hardware domain instead, or be expressed
through a separate host-suspend capability/helper.

>
> > +        host_system_suspend(d);
> > +
> >     return rc;
> > }
> >
> > @@ -290,7 +293,10 @@ static int32_t do_psci_1_0_features(uint32_t 
> > psci_func_id)
> >         return 0;
> >     case PSCI_1_0_FN32_SYSTEM_SUSPEND:
> >     case PSCI_1_0_FN64_SYSTEM_SUSPEND:
> > -        return is_hardware_domain(current->domain) ? PSCI_NOT_SUPPORTED : 
> > 0;
> > +        if ( IS_ENABLED(CONFIG_SYSTEM_SUSPEND) ||
> > +             !is_hardware_domain(current->domain) )
>
> Should this have also the condition that “is hardware domain and psci_ver >= 
> PSCI_VERSION(1, 0)”?
> Otherwise if the host machine doestn’t support PSCI 1.0 we would return OK 
> here but the call would
> fail later in call_psci_system_suspend()?

Good point.

I agree that, for the domain allowed to trigger Xen-wide suspend, Xen should
not advertise SYSTEM_SUSPEND if the host suspend path cannot be used.

I think this should be checked as an explicit host SYSTEM_SUSPEND capability,
rather than only as psci_ver >= PSCI_VERSION(1, 0). The same capability check
also needs to be enforced in the actual SYSTEM_SUSPEND handler before parking
the domain, because a caller may invoke SYSTEM_SUSPEND directly without first
querying PSCI_FEATURES.

For ordinary guests, the physical PSCI version is not relevant because they
cannot trigger host suspend; their SYSTEM_SUSPEND path is virtual.

I will make this consistent in v9: PSCI_FEATURES will advertise SYSTEM_SUSPEND
for the host-suspend-triggering domain only when the host SYSTEM_SUSPEND backend
is available, and the actual SYSTEM_SUSPEND path will enforce the same check.

>
> > +            return 0;
> > +        fallthrough;
> >     default:
> >         return PSCI_NOT_SUPPORTED;
> >     }
> > diff --git a/xen/common/Kconfig b/xen/common/Kconfig
> > index 0a20aa0a12..feb1336f46 100644
> > --- a/xen/common/Kconfig
> > +++ b/xen/common/Kconfig
> > @@ -137,6 +137,9 @@ config HAS_EX_TABLE
> > config HAS_FAST_MULTIPLY
> > bool
> >
> > +config HAS_HWDOM_SYSTEM_SUSPEND
> > + bool
> > +
> > config HAS_IOPORTS
> > bool
> >
> > diff --git a/xen/common/domain.c b/xen/common/domain.c
> > index bb9e210c28..d3edfb2a13 100644
> > --- a/xen/common/domain.c
> > +++ b/xen/common/domain.c
> > @@ -1375,6 +1375,11 @@ void __domain_crash(struct domain *d)
> >     domain_shutdown(d, SHUTDOWN_crash);
> > }
> >
> > +static inline bool want_hwdom_shutdown(uint8_t reason)
> > +{
> > +    return !IS_ENABLED(CONFIG_HAS_HWDOM_SYSTEM_SUSPEND) ||
> > +           reason != SHUTDOWN_suspend;
> > +}
> >
> > int domain_shutdown(struct domain *d, u8 reason)
> > {
> > @@ -1391,7 +1396,7 @@ int domain_shutdown(struct domain *d, u8 reason)
> >         d->shutdown_code = reason;
> >     reason = d->shutdown_code;
> >
> > -    if ( is_hardware_domain(d) )
> > +    if ( is_hardware_domain(d) && want_hwdom_shutdown(reason) )
> >         hwdom_shutdown(reason);
> >
> >     if ( d->is_shutting_down )
> > diff --git a/xen/drivers/passthrough/arm/smmu.c 
> > b/xen/drivers/passthrough/arm/smmu.c
> > index 22d306d0cb..45f29ef8ec 100644
> > --- a/xen/drivers/passthrough/arm/smmu.c
> > +++ b/xen/drivers/passthrough/arm/smmu.c
> > @@ -2947,6 +2947,13 @@ static void arm_smmu_iommu_domain_teardown(struct 
> > domain *d)
> > xfree(xen_domain);
> > }
> >
> > +#ifdef CONFIG_SYSTEM_SUSPEND
> > +static int arm_smmu_suspend(void)
> > +{
> > + return -ENOSYS;
> > +}
> > +#endif
>
> Maybe we want to gate the feature also to !CONFIG_ARM_SMMU ? I would wait for 
> the maintainers
> view on this.

I feel that gating this strictly on !CONFIG_ARM_SMMU might not be the most
optimal approach here.

CONFIG_ARM_SMMU is a build-time option and does not mean that an old SMMUv1/v2
device is actually present. Using it would disable system suspend even on
platforms where only SMMUv3 is used, because CONFIG_ARM_SMMU is enabled by
default for Arm.

The condition should be runtime-based: whether the active/probed IOMMU devices
have system suspend/resume support. For the old ARM SMMU driver this is not
implemented today, so a platform with an SMMUv1/v2 instance should not expose
or attempt host suspend.

I think we should handle this by tracking whether any old ARM SMMUv1/v2 device
was actually probed, or by adding a generic IOMMU suspend capability check. Then
the host suspend availability check can reject system suspend only when such an
unsupported IOMMU is present, instead of disabling the feature for all
Arm builds
with CONFIG_ARM_SMMU enabled.

I would be interested to hear if you or the maintainers see a better way to
express this capability.

Best regards,
Mykola

>
> Cheers,
> Luca
>
>



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.