Xen project Mailing List

Re: [PATCH] xen/arm64: flushtlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI

From: Luca Fancellu <Luca.Fancellu@xxxxxxx>

Date: Tue, 14 Apr 2026 14:00:32 +0000

Accept-language: en-GB, en-US

Arc-authentication-results: i=2; mx.microsoft.com 1; spf=pass (sender ip is 4.158.2.129) smtp.rcpttodomain=amd.com smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=pass (signature was verified) header.d=arm.com; arc=pass (0 oda=1 ltdi=1 spf=[1,1,smtp.mailfrom=arm.com] dkim=[1,1,header.d=arm.com] dmarc=[1,1,header.from=arm.com])

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none

Arc-message-signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=e/4wpittnPMP+IMBMpKGM7HFyqEM2qlCacWZP4+2o9c=; b=g0S1N3UoedM/2U+ucINbtuQwPN9JpEWq5sZba1ii8mLC2i6927QjWp90wW8NUQsMk4THZolhUMNtxcAshAafok/rODsv3NY50wZkmUdA1bbcDm+tA2bCX6m+8zkL7maJdiHCd210Kw3jqqs63fic3Z/65fsCY40HKN4Tc4qQ5D8ydCcqKftwvsyTH3wLsp0v2lCIivl4NNNKJdozp0jjf4/56q7yhNPuwMH62RugJRb9lc+vf2W1Q7Y/dDQuw5/Ed0dFKX5Qpv2GoTRrg2XISAIvzsymtjX5WijlKKF2UmpTpjEanLICEdhiOv6ntVpUNniGUgjPU0imXmohEimbuA==

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=e/4wpittnPMP+IMBMpKGM7HFyqEM2qlCacWZP4+2o9c=; b=brZDQ5ZqVLtgcRmWBsG/sjzMZwq/l/YaNwQrlk7QYqoFyatt0939RmInYbh5H1Gad4uRd78GawK0SRAYvepJpB8vj+5LYBGDowwVLSgmdpf9DCAp30oWYeE7rfWLgtfW4NFtMCzVfMHAjrmVJd6tdrP1eXANBetWuLN9k/WoJJgK/dpC3I9eXmoLlcQKPgqnMMD214xIs+Mzczrru0bb7GAj1YGqIFbNKb82GejUOzs9h0rpfLLS4gq/Kwp6YNQiTm46heTg2cjvAgiOv2qzkhum2GrlMHpjLEcIa0UUwwfulDZa7wYWzHzgfkyrfh5EfjGwcdw5vAziCSS2vxGTsA==

Arc-seal: i=2; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=pass; b=sERpPIsvjS70YwPBDDruA0xoNJE7/O+OV1Ed7NLjtrcl2tK1jLoeRpwHWiEWEElULf+AlOlgsoAnE6AvFa3GWZY18He383C/NNZYiBNgmPiZMAttFvA6bfWTI2OFM2N9gzIO1cyYgTYoOjxyKvQawEHORN4qlv22HE4JOYV8aI5oVc5S6UKhMDl/NTqH9iZibaxP3yrRTv0MUlWvMXAJScJPb9JR3PGR19cNW+flkLc0Xzsa3ATFq72zr/Q9/FNAZZHF5Yr3Jvwme+hRgsUXekv+oP8m9Ag8F50H8GzXLdWX8mvkJ5j0GZyw0tSLQyoeM8mot/sg1Edn4k14jAT2wQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=YMbnOSiJiY1JByKzj9Gxnc/elnpZBW8pDo2h/VNrPXL3Z5MS9pm73X06vDIZX6EEnC1ag9tDq9RdcaY/z5KgAdxHe+WmW57W263YbQ0HjL0BCm7Ig6zW+wJ3L7VdLF5BQAw8heHJ2JME2TD+5/UmQvlMl2BXp3P8Fo/wH73yomLSRL/KVSKbs/aZhR2yv8NrtUEJno7P2tS0GWA3rIeWxwsgXK23Y2aH+UdU64BL2Xu2uFKQsIWNCNqJz5dPE0m6xenUG17AR+CKXKZcQVRD1PrwqWk/xGLO5XkU9SQI2z7irfM18noeBquI4gBlwPFLGUA+tgiFkWCdZcbGA+MA6A==

Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=arm.com header.i="@arm.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"; dkim=pass header.s=selector1 header.d=arm.com header.i="@arm.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"

Authentication-results-original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Bertrand Marquis <Bertrand.Marquis@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Mark Rutland <Mark.Rutland@xxxxxxx>

Delivery-date: Tue, 14 Apr 2026 14:02:02 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Nodisclaimer: true

Thread-index: AQHcy+Zsv818FERxykqUrZrX/7GOFLXeli0A

Thread-topic: [PATCH] xen/arm64: flushtlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI

Hi Michal, > On 14 Apr 2026, at 09:11, Michal Orzel <michal.orzel@xxxxxxx> wrote: > > The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several > errata where broadcast TLBI;DSB sequences don't provide all the > architecturally required synchronization. The workaround performs more > work than necessary, and can have significant overhead. This patch > optimizes the workaround, as explained below. > > 1. All relevant errata only affect the ordering and/or completion of > memory accesses which have been translated by an invalidated TLB > entry. The actual invalidation of TLB entries is unaffected. > > 2. The existing workaround is applied to both broadcast and local TLB > invalidation, whereas for all relevant errata it is only necessary to > apply a workaround for broadcast invalidation. > > 3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI > sequence, whereas for all relevant errata it is only necessary to > execute a single additional TLBI;DSB sequence after any number of > TLBIs are completed by a DSB. > > For example, for a sequence of batched TLBIs: > > TLBI <op1>[, <arg1>] > TLBI <op2>[, <arg2>] > TLBI <op3>[, <arg3>] > DSB ISH > > ... the existing workaround will expand this to: > > TLBI <op1>[, <arg1>] > DSB ISH // additional > TLBI <op1>[, <arg1>] // additional > TLBI <op2>[, <arg2>] > DSB ISH // additional > TLBI <op2>[, <arg2>] // additional > TLBI <op3>[, <arg3>] > DSB ISH // additional > TLBI <op3>[, <arg3>] // additional > DSB ISH > > ... whereas it is sufficient to have: > > TLBI <op1>[, <arg1>] > TLBI <op2>[, <arg2>] > TLBI <op3>[, <arg3>] > DSB ISH > TLBI <opX>[, <argX>] // additional > DSB ISH // additional > > Using a single additional TBLI and DSB at the end of the sequence can NIT: Typo s/TBLI/TLBI > have significantly lower overhead as each DSB which completes a TLBI > must synchronize with other PEs in the system, with potential > performance effects both locally and system-wide. > > 4. The existing workaround repeats each specific TLBI operation, whereas > for all relevant errata it is sufficient for the additional TLBI to > use *any* operation which will be broadcast, regardless of which > translation regime or stage of translation the operation applies to. > > For example, for a single TLBI: > > TLBI ALLE2IS > DSB ISH > > ... the existing workaround will expand this to: > > TLBI ALLE2IS > DSB ISH > TLBI ALLE2IS // additional > DSB ISH // additional > > ... whereas it is sufficient to have: > > TLBI ALLE2IS > DSB ISH > TLBI VALE1IS, XZR // additional > DSB ISH // additional > > As the additional TLBI doesn't have to match a specific earlier TLBI, > the additional TLBI can be implemented in separate code, with no > memory of the earlier TLBIs. The additional TLBI can also use a > cheaper TLBI operation. > > 5. The existing workaround is applied to both Stage-1 and Stage-2 TLB > invalidation, whereas for all relevant errata it is only necessary to > apply a workaround for Stage-1 invalidation. > > Architecturally, TLBI operations which invalidate only Stage-2 > information (e.g. IPAS2E1IS) are not required to invalidate TLB > entries which combine information from Stage-1 and Stage-2 > translation table entries, and consequently may not complete memory > accesses translated by those combined entries. In these cases, > completion of memory accesses is only guaranteed after subsequent > invalidation of Stage-1 information (e.g. VMALLE1IS). > > Rework the workaround logic as follows: > - add TLB_HELPER_LOCAL() to be used for local TLB ops without a > workaround, > - modify TLB_HELPER() workaround to use tlbi vale2is, xzr as a second > TLB, TLBI ? > - drop TLB_HELPER_VA(). It's used only by __flush_xen_tlb_one_local > which is local and does not need workaround and by > __flush_xen_tlb_one. In the latter case, since it's used in a loop, > we don't need a workaround in the middle. Add __tlb_repeat_sync with > a workaround to be used at the end after DSB and before final ISB, > - TLBI VALE2IS passing XZR is used as an additional TLBI. While there is > an identity mapping there, it's used very rarely. The performance > impact is therefore negligible. If things change in the future, we > can revisit the decision. > > Signed-off-by: Michal Orzel <michal.orzel@xxxxxxx> > --- > Linux counterpart (already merged): > https://lore.kernel.org/linux-arm-kernel/20260218164348.2022831-1-mark.rutland@xxxxxxx/ > --- > xen/arch/arm/include/asm/arm32/flushtlb.h | 3 + > xen/arch/arm/include/asm/arm64/flushtlb.h | 108 ++++++++++++++-------- > xen/arch/arm/include/asm/flushtlb.h | 1 + > 3 files changed, 71 insertions(+), 41 deletions(-) > > diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h > b/xen/arch/arm/include/asm/arm32/flushtlb.h > index 61c25a318998..5483be08fbbe 100644 > --- a/xen/arch/arm/include/asm/arm32/flushtlb.h > +++ b/xen/arch/arm/include/asm/arm32/flushtlb.h > @@ -57,6 +57,9 @@ static inline void __flush_xen_tlb_one(vaddr_t va) > asm volatile(STORE_CP32(0, TLBIMVAHIS) : : "r" (va) : "memory"); > } > > +/* Only for ARM64_WORKAROUND_REPEAT_TLBI */ > +static inline void __tlb_repeat_sync(void) {} > + > #endif /* __ASM_ARM_ARM32_FLUSHTLB_H__ */ > /* > * Local variables: > diff --git a/xen/arch/arm/include/asm/arm64/flushtlb.h > b/xen/arch/arm/include/asm/arm64/flushtlb.h > index 3b99c11b50d1..1606b26bf28a 100644 > --- a/xen/arch/arm/include/asm/arm64/flushtlb.h > +++ b/xen/arch/arm/include/asm/arm64/flushtlb.h > @@ -12,9 +12,14 @@ > * ARM64_WORKAROUND_REPEAT_TLBI: > * Modification of the translation table for a virtual address might lead to > * read-after-read ordering violation. > - * The workaround repeats TLBI+DSB ISH operation for all the TLB flush > - * operations. While this is strictly not necessary, we don't want to > - * take any risk. > + * The workaround repeats TLBI+DSB ISH operation for broadcast TLB flush > + * operations. The workaround is not needed for local operations. > + * > + * It is sufficient for the additional TLBI to use *any* operation which will > + * be broadcast, regardless of which translation regime or stage of > translation > + * the operation applies to. TLBI VALE2IS is used passing XZR. While there is > + * an identity mapping there, it's only used during suspend/resume, CPU > on/off, > + * so the impact (performance if any) is negligible. > * > * For Xen page-tables the ISB will discard any instructions fetched > * from the old mappings. > @@ -26,69 +31,90 @@ > * Note that for local TLB flush, using non-shareable (nsh) is sufficient > * (see D5-4929 in ARM DDI 0487H.a). Although, the memory barrier in > * for the workaround is left as inner-shareable to match with Linux > - * v6.1-rc8. > + * v6.19. > */ > -#define TLB_HELPER(name, tlbop, sh) \ > +#define TLB_HELPER_LOCAL(name, tlbop) \ > static inline void name(void) \ > { \ > asm_inline volatile ( \ > - "dsb " # sh "st;" \ > + "dsb nshst;" \ > "tlbi " # tlbop ";" \ > - ALTERNATIVE( \ > - "nop; nop;", \ > - "dsb ish;" \ > - "tlbi " # tlbop ";", \ > - ARM64_WORKAROUND_REPEAT_TLBI, \ > - CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ > - "dsb " # sh ";" \ > + "dsb nsh;" \ > "isb;" \ > : : : "memory"); \ > } > > -/* > - * FLush TLB by VA. This will likely be used in a loop, so the caller > - * is responsible to use the appropriate memory barriers before/after > - * the sequence. > - * > - * See above about the ARM64_WORKAROUND_REPEAT_TLBI sequence. > - */ > -#define TLB_HELPER_VA(name, tlbop) \ > -static inline void name(vaddr_t va) \ > -{ \ > - asm_inline volatile ( \ > - "tlbi " # tlbop ", %0;" \ > - ALTERNATIVE( \ > - "nop; nop;", \ > - "dsb ish;" \ > - "tlbi " # tlbop ", %0;", \ > - ARM64_WORKAROUND_REPEAT_TLBI, \ > - CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ > - : : "r" (va >> PAGE_SHIFT) : "memory"); \ > +#define TLB_HELPER(name, tlbop) \ > +static inline void name(void) \ > +{ \ > + asm_inline volatile ( \ > + "dsb ishst;" \ > + "tlbi " # tlbop ";" \ > + ALTERNATIVE( \ > + "nop; nop;", \ > + "dsb ish;" \ > + "tlbi vale2is, xzr;", \ > + ARM64_WORKAROUND_REPEAT_TLBI, \ > + CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ > + "dsb ish;" \ > + "isb;" \ > + : : : "memory"); \ > } > > /* Flush local TLBs, current VMID only. */ > -TLB_HELPER(flush_guest_tlb_local, vmalls12e1, nsh) > +TLB_HELPER_LOCAL(flush_guest_tlb_local, vmalls12e1) > > /* Flush innershareable TLBs, current VMID only */ > -TLB_HELPER(flush_guest_tlb, vmalls12e1is, ish) > +TLB_HELPER(flush_guest_tlb, vmalls12e1is) > > /* Flush local TLBs, all VMIDs, non-hypervisor mode */ > -TLB_HELPER(flush_all_guests_tlb_local, alle1, nsh) > +TLB_HELPER_LOCAL(flush_all_guests_tlb_local, alle1) > > /* Flush innershareable TLBs, all VMIDs, non-hypervisor mode */ > -TLB_HELPER(flush_all_guests_tlb, alle1is, ish) > +TLB_HELPER(flush_all_guests_tlb, alle1is) > > /* Flush all hypervisor mappings from the TLB of the local processor. */ > -TLB_HELPER(flush_xen_tlb_local, alle2, nsh) > +TLB_HELPER_LOCAL(flush_xen_tlb_local, alle2) > + > +#undef TLB_HELPER_LOCAL > +#undef TLB_HELPER > + > +/* > + * FLush TLB by VA. This will likely be used in a loop, so the caller > + * is responsible to use the appropriate memory barriers before/after > + * the sequence. > + */ > > /* Flush TLB of local processor for address va. */ > -TLB_HELPER_VA(__flush_xen_tlb_one_local, vae2) > +static inline void __flush_xen_tlb_one_local(vaddr_t va) > +{ > + asm_inline volatile ( > + "tlbi vae2, %0" : : "r" (va >> PAGE_SHIFT) : "memory"); > +} > > /* Flush TLB of all processors in the inner-shareable domain for address va. > */ > -TLB_HELPER_VA(__flush_xen_tlb_one, vae2is) > +static inline void __flush_xen_tlb_one(vaddr_t va) > +{ > + asm_inline volatile ( > + "tlbi vae2is, %0" : : "r" (va >> PAGE_SHIFT) : "memory"); > +} > > -#undef TLB_HELPER > -#undef TLB_HELPER_VA > +/* > + * ARM64_WORKAROUND_REPEAT_TLBI: > + * For all relevant erratas it is only necessary to execute a single > + * additional TLBI;DSB sequence after any number of TLBIs are completed by > DSB. > + */ > +static inline void __tlb_repeat_sync(void) > +{ > + asm_inline volatile ( > + ALTERNATIVE( > + "nop; nop;", > + "tlbi vale2is, xzr;" > + "dsb ish;", > + ARM64_WORKAROUND_REPEAT_TLBI, > + CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) > + : : : "memory"); > +} > > #endif /* __ASM_ARM_ARM64_FLUSHTLB_H__ */ > /* > diff --git a/xen/arch/arm/include/asm/flushtlb.h > b/xen/arch/arm/include/asm/flushtlb.h > index e45fb6d97b02..c292c3c00d29 100644 > --- a/xen/arch/arm/include/asm/flushtlb.h > +++ b/xen/arch/arm/include/asm/flushtlb.h > @@ -65,6 +65,7 @@ static inline void flush_xen_tlb_range_va(vaddr_t va, > va += PAGE_SIZE; > } > dsb(ish); /* Ensure the TLB invalidation has completed */ > + __tlb_repeat_sync(); More a question here rather than a comment, shall we have a comment on top of this stating that it’s deliberate to have it before the isb? Or developer should infer it from the code and from git blame? > isb(); > } > > -- > 2.43.0 > > With these fixed: Reviewed-by: Luca Fancellu <luca.fancellu@xxxxxxx> Cheers, Luca

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.