[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] xen/arm64: flushtlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI


  • To: Michal Orzel <michal.orzel@xxxxxxx>
  • From: Luca Fancellu <Luca.Fancellu@xxxxxxx>
  • Date: Tue, 14 Apr 2026 14:00:32 +0000
  • Accept-language: en-GB, en-US
  • Arc-authentication-results: i=2; mx.microsoft.com 1; spf=pass (sender ip is 4.158.2.129) smtp.rcpttodomain=amd.com smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=pass (signature was verified) header.d=arm.com; arc=pass (0 oda=1 ltdi=1 spf=[1,1,smtp.mailfrom=arm.com] dkim=[1,1,header.d=arm.com] dmarc=[1,1,header.from=arm.com])
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
  • Arc-message-signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=e/4wpittnPMP+IMBMpKGM7HFyqEM2qlCacWZP4+2o9c=; b=g0S1N3UoedM/2U+ucINbtuQwPN9JpEWq5sZba1ii8mLC2i6927QjWp90wW8NUQsMk4THZolhUMNtxcAshAafok/rODsv3NY50wZkmUdA1bbcDm+tA2bCX6m+8zkL7maJdiHCd210Kw3jqqs63fic3Z/65fsCY40HKN4Tc4qQ5D8ydCcqKftwvsyTH3wLsp0v2lCIivl4NNNKJdozp0jjf4/56q7yhNPuwMH62RugJRb9lc+vf2W1Q7Y/dDQuw5/Ed0dFKX5Qpv2GoTRrg2XISAIvzsymtjX5WijlKKF2UmpTpjEanLICEdhiOv6ntVpUNniGUgjPU0imXmohEimbuA==
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=e/4wpittnPMP+IMBMpKGM7HFyqEM2qlCacWZP4+2o9c=; b=brZDQ5ZqVLtgcRmWBsG/sjzMZwq/l/YaNwQrlk7QYqoFyatt0939RmInYbh5H1Gad4uRd78GawK0SRAYvepJpB8vj+5LYBGDowwVLSgmdpf9DCAp30oWYeE7rfWLgtfW4NFtMCzVfMHAjrmVJd6tdrP1eXANBetWuLN9k/WoJJgK/dpC3I9eXmoLlcQKPgqnMMD214xIs+Mzczrru0bb7GAj1YGqIFbNKb82GejUOzs9h0rpfLLS4gq/Kwp6YNQiTm46heTg2cjvAgiOv2qzkhum2GrlMHpjLEcIa0UUwwfulDZa7wYWzHzgfkyrfh5EfjGwcdw5vAziCSS2vxGTsA==
  • Arc-seal: i=2; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=pass; b=sERpPIsvjS70YwPBDDruA0xoNJE7/O+OV1Ed7NLjtrcl2tK1jLoeRpwHWiEWEElULf+AlOlgsoAnE6AvFa3GWZY18He383C/NNZYiBNgmPiZMAttFvA6bfWTI2OFM2N9gzIO1cyYgTYoOjxyKvQawEHORN4qlv22HE4JOYV8aI5oVc5S6UKhMDl/NTqH9iZibaxP3yrRTv0MUlWvMXAJScJPb9JR3PGR19cNW+flkLc0Xzsa3ATFq72zr/Q9/FNAZZHF5Yr3Jvwme+hRgsUXekv+oP8m9Ag8F50H8GzXLdWX8mvkJ5j0GZyw0tSLQyoeM8mot/sg1Edn4k14jAT2wQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=YMbnOSiJiY1JByKzj9Gxnc/elnpZBW8pDo2h/VNrPXL3Z5MS9pm73X06vDIZX6EEnC1ag9tDq9RdcaY/z5KgAdxHe+WmW57W263YbQ0HjL0BCm7Ig6zW+wJ3L7VdLF5BQAw8heHJ2JME2TD+5/UmQvlMl2BXp3P8Fo/wH73yomLSRL/KVSKbs/aZhR2yv8NrtUEJno7P2tS0GWA3rIeWxwsgXK23Y2aH+UdU64BL2Xu2uFKQsIWNCNqJz5dPE0m6xenUG17AR+CKXKZcQVRD1PrwqWk/xGLO5XkU9SQI2z7irfM18noeBquI4gBlwPFLGUA+tgiFkWCdZcbGA+MA6A==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=arm.com header.i="@arm.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"; dkim=pass header.s=selector1 header.d=arm.com header.i="@arm.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"
  • Authentication-results-original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Mark Rutland <Mark.Rutland@xxxxxxx>
  • Delivery-date: Tue, 14 Apr 2026 14:02:02 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Nodisclaimer: true
  • Thread-index: AQHcy+Zsv818FERxykqUrZrX/7GOFLXeli0A
  • Thread-topic: [PATCH] xen/arm64: flushtlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI

Hi Michal,

> On 14 Apr 2026, at 09:11, Michal Orzel <michal.orzel@xxxxxxx> wrote:
> 
> The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several
> errata where broadcast TLBI;DSB sequences don't provide all the
> architecturally required synchronization. The workaround performs more
> work than necessary, and can have significant overhead. This patch
> optimizes the workaround, as explained below.
> 
> 1. All relevant errata only affect the ordering and/or completion of
>   memory accesses which have been translated by an invalidated TLB
>   entry. The actual invalidation of TLB entries is unaffected.
> 
> 2. The existing workaround is applied to both broadcast and local TLB
>   invalidation, whereas for all relevant errata it is only necessary to
>   apply a workaround for broadcast invalidation.
> 
> 3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI
>   sequence, whereas for all relevant errata it is only necessary to
>   execute a single additional TLBI;DSB sequence after any number of
>   TLBIs are completed by a DSB.
> 
>   For example, for a sequence of batched TLBIs:
> 
>       TLBI <op1>[, <arg1>]
>       TLBI <op2>[, <arg2>]
>       TLBI <op3>[, <arg3>]
>       DSB ISH
> 
>   ... the existing workaround will expand this to:
> 
>       TLBI <op1>[, <arg1>]
>       DSB ISH                  // additional
>       TLBI <op1>[, <arg1>]     // additional
>       TLBI <op2>[, <arg2>]
>       DSB ISH                  // additional
>       TLBI <op2>[, <arg2>]     // additional
>       TLBI <op3>[, <arg3>]
>       DSB ISH                  // additional
>       TLBI <op3>[, <arg3>]     // additional
>       DSB ISH
> 
>   ... whereas it is sufficient to have:
> 
>       TLBI <op1>[, <arg1>]
>       TLBI <op2>[, <arg2>]
>       TLBI <op3>[, <arg3>]
>       DSB ISH
>       TLBI <opX>[, <argX>]     // additional
>       DSB ISH                  // additional
> 
>   Using a single additional TBLI and DSB at the end of the sequence can

NIT: Typo s/TBLI/TLBI

>   have significantly lower overhead as each DSB which completes a TLBI
>   must synchronize with other PEs in the system, with potential
>   performance effects both locally and system-wide.
> 
> 4. The existing workaround repeats each specific TLBI operation, whereas
>   for all relevant errata it is sufficient for the additional TLBI to
>   use *any* operation which will be broadcast, regardless of which
>   translation regime or stage of translation the operation applies to.
> 
>   For example, for a single TLBI:
> 
>       TLBI ALLE2IS
>       DSB ISH
> 
>   ... the existing workaround will expand this to:
> 
>       TLBI ALLE2IS
>       DSB ISH
>       TLBI ALLE2IS             // additional
>       DSB ISH                  // additional
> 
>   ... whereas it is sufficient to have:
> 
>       TLBI ALLE2IS
>       DSB ISH
>       TLBI VALE1IS, XZR        // additional
>       DSB ISH                  // additional
> 
>   As the additional TLBI doesn't have to match a specific earlier TLBI,
>   the additional TLBI can be implemented in separate code, with no
>   memory of the earlier TLBIs. The additional TLBI can also use a
>   cheaper TLBI operation.
> 
> 5. The existing workaround is applied to both Stage-1 and Stage-2 TLB
>   invalidation, whereas for all relevant errata it is only necessary to
>   apply a workaround for Stage-1 invalidation.
> 
>   Architecturally, TLBI operations which invalidate only Stage-2
>   information (e.g. IPAS2E1IS) are not required to invalidate TLB
>   entries which combine information from Stage-1 and Stage-2
>   translation table entries, and consequently may not complete memory
>   accesses translated by those combined entries. In these cases,
>   completion of memory accesses is only guaranteed after subsequent
>   invalidation of Stage-1 information (e.g. VMALLE1IS).
> 
> Rework the workaround logic as follows:
> - add TLB_HELPER_LOCAL() to be used for local TLB ops without a
>   workaround,
> - modify TLB_HELPER() workaround to use tlbi vale2is, xzr as a second
>   TLB,

TLBI ?

> - drop TLB_HELPER_VA(). It's used only by __flush_xen_tlb_one_local
>   which is local and does not need workaround and by
>   __flush_xen_tlb_one. In the latter case, since it's used in a loop,
>   we don't need a workaround in the middle. Add __tlb_repeat_sync with
>   a workaround to be used at the end after DSB and before final ISB,
> - TLBI VALE2IS passing XZR is used as an additional TLBI. While there is
>   an identity mapping there, it's used very rarely. The performance
>   impact is therefore negligible. If things change in the future, we
>   can revisit the decision.
> 
> Signed-off-by: Michal Orzel <michal.orzel@xxxxxxx>
> ---
> Linux counterpart (already merged):
> https://lore.kernel.org/linux-arm-kernel/20260218164348.2022831-1-mark.rutland@xxxxxxx/
> ---
> xen/arch/arm/include/asm/arm32/flushtlb.h |   3 +
> xen/arch/arm/include/asm/arm64/flushtlb.h | 108 ++++++++++++++--------
> xen/arch/arm/include/asm/flushtlb.h       |   1 +
> 3 files changed, 71 insertions(+), 41 deletions(-)
> 
> diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h 
> b/xen/arch/arm/include/asm/arm32/flushtlb.h
> index 61c25a318998..5483be08fbbe 100644
> --- a/xen/arch/arm/include/asm/arm32/flushtlb.h
> +++ b/xen/arch/arm/include/asm/arm32/flushtlb.h
> @@ -57,6 +57,9 @@ static inline void __flush_xen_tlb_one(vaddr_t va)
>     asm volatile(STORE_CP32(0, TLBIMVAHIS) : : "r" (va) : "memory");
> }
> 
> +/* Only for ARM64_WORKAROUND_REPEAT_TLBI */
> +static inline void __tlb_repeat_sync(void) {}
> +
> #endif /* __ASM_ARM_ARM32_FLUSHTLB_H__ */
> /*
>  * Local variables:
> diff --git a/xen/arch/arm/include/asm/arm64/flushtlb.h 
> b/xen/arch/arm/include/asm/arm64/flushtlb.h
> index 3b99c11b50d1..1606b26bf28a 100644
> --- a/xen/arch/arm/include/asm/arm64/flushtlb.h
> +++ b/xen/arch/arm/include/asm/arm64/flushtlb.h
> @@ -12,9 +12,14 @@
>  * ARM64_WORKAROUND_REPEAT_TLBI:
>  * Modification of the translation table for a virtual address might lead to
>  * read-after-read ordering violation.
> - * The workaround repeats TLBI+DSB ISH operation for all the TLB flush
> - * operations. While this is strictly not necessary, we don't want to
> - * take any risk.
> + * The workaround repeats TLBI+DSB ISH operation for broadcast TLB flush
> + * operations. The workaround is not needed for local operations.
> + *
> + * It is sufficient for the additional TLBI to use *any* operation which will
> + * be broadcast, regardless of which translation regime or stage of 
> translation
> + * the operation applies to. TLBI VALE2IS is used passing XZR. While there is
> + * an identity mapping there, it's only used during suspend/resume, CPU 
> on/off,
> + * so the impact (performance if any) is negligible.
>  *
>  * For Xen page-tables the ISB will discard any instructions fetched
>  * from the old mappings.
> @@ -26,69 +31,90 @@
>  * Note that for local TLB flush, using non-shareable (nsh) is sufficient
>  * (see D5-4929 in ARM DDI 0487H.a). Although, the memory barrier in
>  * for the workaround is left as inner-shareable to match with Linux
> - * v6.1-rc8.
> + * v6.19.
>  */
> -#define TLB_HELPER(name, tlbop, sh)              \
> +#define TLB_HELPER_LOCAL(name, tlbop)            \
> static inline void name(void)                    \
> {                                                \
>     asm_inline volatile (                        \
> -        "dsb  "  # sh  "st;"                     \
> +        "dsb  nshst;"                            \
>         "tlbi "  # tlbop  ";"                    \
> -        ALTERNATIVE(                             \
> -            "nop; nop;",                         \
> -            "dsb  ish;"                          \
> -            "tlbi "  # tlbop  ";",               \
> -            ARM64_WORKAROUND_REPEAT_TLBI,        \
> -            CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
> -        "dsb  "  # sh  ";"                       \
> +        "dsb  nsh;"                              \
>         "isb;"                                   \
>         : : : "memory");                         \
> }
> 
> -/*
> - * FLush TLB by VA. This will likely be used in a loop, so the caller
> - * is responsible to use the appropriate memory barriers before/after
> - * the sequence.
> - *
> - * See above about the ARM64_WORKAROUND_REPEAT_TLBI sequence.
> - */
> -#define TLB_HELPER_VA(name, tlbop)               \
> -static inline void name(vaddr_t va)              \
> -{                                                \
> -    asm_inline volatile (                        \
> -        "tlbi "  # tlbop  ", %0;"                \
> -        ALTERNATIVE(                             \
> -            "nop; nop;",                         \
> -            "dsb  ish;"                          \
> -            "tlbi "  # tlbop  ", %0;",           \
> -            ARM64_WORKAROUND_REPEAT_TLBI,        \
> -            CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
> -        : : "r" (va >> PAGE_SHIFT) : "memory");  \
> +#define TLB_HELPER(name, tlbop)                       \
> +static inline void name(void)                         \
> +{                                                     \
> +    asm_inline volatile (                             \
> +        "dsb  ishst;"                                 \
> +        "tlbi "  # tlbop  ";"                         \
> +        ALTERNATIVE(                                  \
> +            "nop; nop;",                              \
> +            "dsb  ish;"                               \
> +            "tlbi vale2is, xzr;",                     \
> +            ARM64_WORKAROUND_REPEAT_TLBI,             \
> +            CONFIG_ARM64_WORKAROUND_REPEAT_TLBI)      \
> +        "dsb  ish;"                                   \
> +        "isb;"                                        \
> +        : : : "memory"); \
> }
> 
> /* Flush local TLBs, current VMID only. */
> -TLB_HELPER(flush_guest_tlb_local, vmalls12e1, nsh)
> +TLB_HELPER_LOCAL(flush_guest_tlb_local, vmalls12e1)
> 
> /* Flush innershareable TLBs, current VMID only */
> -TLB_HELPER(flush_guest_tlb, vmalls12e1is, ish)
> +TLB_HELPER(flush_guest_tlb, vmalls12e1is)
> 
> /* Flush local TLBs, all VMIDs, non-hypervisor mode */
> -TLB_HELPER(flush_all_guests_tlb_local, alle1, nsh)
> +TLB_HELPER_LOCAL(flush_all_guests_tlb_local, alle1)
> 
> /* Flush innershareable TLBs, all VMIDs, non-hypervisor mode */
> -TLB_HELPER(flush_all_guests_tlb, alle1is, ish)
> +TLB_HELPER(flush_all_guests_tlb, alle1is)
> 
> /* Flush all hypervisor mappings from the TLB of the local processor. */
> -TLB_HELPER(flush_xen_tlb_local, alle2, nsh)
> +TLB_HELPER_LOCAL(flush_xen_tlb_local, alle2)
> +
> +#undef TLB_HELPER_LOCAL
> +#undef TLB_HELPER
> +
> +/*
> + * FLush TLB by VA. This will likely be used in a loop, so the caller
> + * is responsible to use the appropriate memory barriers before/after
> + * the sequence.
> + */
> 
> /* Flush TLB of local processor for address va. */
> -TLB_HELPER_VA(__flush_xen_tlb_one_local, vae2)
> +static inline void __flush_xen_tlb_one_local(vaddr_t va)
> +{
> +    asm_inline volatile (
> +        "tlbi vae2, %0" : : "r" (va >> PAGE_SHIFT) : "memory");
> +}
> 
> /* Flush TLB of all processors in the inner-shareable domain for address va. 
> */
> -TLB_HELPER_VA(__flush_xen_tlb_one, vae2is)
> +static inline void __flush_xen_tlb_one(vaddr_t va)
> +{
> +    asm_inline volatile (
> +        "tlbi vae2is, %0" : : "r" (va >> PAGE_SHIFT) : "memory");
> +}
> 
> -#undef TLB_HELPER
> -#undef TLB_HELPER_VA
> +/*
> + * ARM64_WORKAROUND_REPEAT_TLBI:
> + * For all relevant erratas it is only necessary to execute a single
> + * additional TLBI;DSB sequence after any number of TLBIs are completed by 
> DSB.
> + */
> +static inline void __tlb_repeat_sync(void)
> +{
> +    asm_inline volatile (
> +        ALTERNATIVE(
> +            "nop; nop;",
> +            "tlbi vale2is, xzr;"
> +            "dsb  ish;",
> +            ARM64_WORKAROUND_REPEAT_TLBI,
> +            CONFIG_ARM64_WORKAROUND_REPEAT_TLBI)
> +        : : : "memory");
> +}
> 
> #endif /* __ASM_ARM_ARM64_FLUSHTLB_H__ */
> /*
> diff --git a/xen/arch/arm/include/asm/flushtlb.h 
> b/xen/arch/arm/include/asm/flushtlb.h
> index e45fb6d97b02..c292c3c00d29 100644
> --- a/xen/arch/arm/include/asm/flushtlb.h
> +++ b/xen/arch/arm/include/asm/flushtlb.h
> @@ -65,6 +65,7 @@ static inline void flush_xen_tlb_range_va(vaddr_t va,
>         va += PAGE_SIZE;
>     }
>     dsb(ish); /* Ensure the TLB invalidation has completed */
> +    __tlb_repeat_sync();

More a question here rather than a comment, shall we have a comment on top
of this stating that it’s deliberate to have it before the isb?
Or developer should infer it from the code and from git blame?

>     isb();
> }
> 
> -- 
> 2.43.0
> 
> 

With these fixed:

Reviewed-by: Luca Fancellu <luca.fancellu@xxxxxxx>

Cheers,
Luca


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.