Xen project Mailing List

Re: [PATCH v4 16/21] VT-d: free all-empty page tables

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Wed, 18 May 2022 12:26:03 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=BEXgRyi4Kx70iWWfzPx0MrfXl2zNjsllzDLpLgsv1+8=; b=CN/4gNHiLl+u0Ho3z4/Y+xJ1eA48ke0JjVRgsi8Vdya5SH3jnsd62TDiIoVc8MkQpeNzdU3/T2Tqxj2VeuDQbAMFS1EcbgbcLdWSi425mjyFQc3ZgcHr85kcKrmq9cBUSubS6Yu9uqtzn76KmHQbo7iE+CIA71zKM9ApBn5njrVje5/V3ZGns2Y9xpInDsjqbNliGZE0yKn2HH0lawq/zjrKKQtFWiGC5snMk29t1M/4lAidDjgRjLSHvQBRDDQbrnPBnX8xHZaj439Fxo4q2tIAeNwXofNxxbNJ6AgzmCs7kMsA2DP8Y2SRht/jGq9e6FDjEUzC9SNXgDfBPTe3bA==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=kjI8N/Xx7kF3tWhxC1SVU2atY7AQMT6KOfgXU8OCWUcnvnKBYCY+V3plZXCFezkcu0YJtKqQY3BMIbirXwI+WCgHgHta8YKqDHlyic8dBHIukipmeQrxY/ZkBl+QItvO5PPe6jFtE+/kA6EfyzbquddwQZ4mSJ7DCcXB0UlE35QqwqtBxDNSaH6v9eO0F19VUPfFKnAvfzGk3x6Zp7QzEL5pKYm4ilyeVLqudJS66z7LIXrU4Q6I3izqYnlxufW3419NfRl4aKaOykK3zhA/o0G9vSf+mf3W6gZnkBUcPAWKq5ERV1JgSamVvpvrBz5f2kucHwDlU6Pbr76BvruKfQ==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>

Delivery-date: Wed, 18 May 2022 10:26:12 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 10.05.2022 16:30, Roger Pau Monné wrote: > On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote: >> When a page table ends up with no present entries left, it can be >> replaced by a non-present entry at the next higher level. The page table >> itself can then be scheduled for freeing. >> >> Note that while its output isn't used there yet, >> pt_update_contig_markers() right away needs to be called in all places >> where entries get updated, not just the one where entries get cleared. >> >> Note further that while pt_update_contig_markers() updates perhaps >> several PTEs within the table, since these are changes to "avail" bits >> only I do not think that cache flushing would be needed afterwards. Such >> cache flushing (of entire pages, unless adding yet more logic to me more >> selective) would be quite noticable performance-wise (very prominent >> during Dom0 boot). >> >> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> >> --- >> v4: Re-base over changes earlier in the series. >> v3: Properly bound loop. Re-base over changes earlier in the series. >> v2: New. >> --- >> The hang during boot on my Latitude E6410 (see the respective code >> comment) was pretty close after iommu_enable_translation(). No errors, >> no watchdog would kick in, just sometimes the first few pixel lines of >> the next log message's (XEN) prefix would have made it out to the screen >> (and there's no serial there). It's been a lot of experimenting until I >> figured the workaround (which I consider ugly, but halfway acceptable). >> I've been trying hard to make sure the workaround wouldn't be masking a >> real issue, yet I'm still wary of it possibly doing so ... My best guess >> at this point is that on these old IOMMUs the ignored bits 52...61 >> aren't really ignored for present entries, but also aren't "reserved" >> enough to trigger faults. This guess is from having tried to set other >> bits in this range (unconditionally, and with the workaround here in >> place), which yielded the same behavior. > > Should we take Kevin's Reviewed-by as a heads up that bits 52..61 on > some? IOMMUs are not usable? > > Would be good if we could get a more formal response I think. A more formal response would be nice, but given the age of the affected hardware I don't expect anything more will be done there by Intel. >> @@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s >> >> write_atomic(&pte->val, new_pte.val); >> iommu_sync_cache(pte, sizeof(struct dma_pte)); >> + pt_update_contig_markers(&parent->val, >> + address_level_offset(addr, level), > > I think (unless previous patches in the series have changed this) > there already is an 'offset' local variable that you could use. The variable is clobbered by "IOMMU/x86: prefill newly allocate page tables". >> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma >> >> old = *pte; >> dma_clear_pte(*pte); >> + iommu_sync_cache(pte, sizeof(*pte)); >> + >> + while ( pt_update_contig_markers(&page->val, >> + address_level_offset(addr, level), >> + level, PTE_kind_null) && >> + ++level < min_pt_levels ) >> + { >> + struct page_info *pg = maddr_to_page(pg_maddr); >> + >> + unmap_vtd_domain_page(page); >> + >> + pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags, >> + false); >> + BUG_ON(pg_maddr < PAGE_SIZE); >> + >> + page = map_vtd_domain_page(pg_maddr); >> + pte = &page[address_level_offset(addr, level)]; >> + dma_clear_pte(*pte); >> + iommu_sync_cache(pte, sizeof(*pte)); >> + >> + *flush_flags |= IOMMU_FLUSHF_all; >> + iommu_queue_free_pgtable(hd, pg); >> + } > > I think I'm setting myself for trouble, but do we need to sync cache > the lower lever entries if higher level ones are to be changed. > > IOW, would it be fine to just flush the highest level modified PTE? > As the lower lever ones won't be reachable anyway. I definitely want to err on the safe side here. If later we can prove that some cache flush is unneeded, I'd be happy to see it go away. >> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i >> } >> >> *pte = new; >> - >> iommu_sync_cache(pte, sizeof(struct dma_pte)); >> + >> + /* >> + * While the (ab)use of PTE_kind_table here allows to save some work in >> + * the function, the main motivation for it is that it avoids a so far >> + * unexplained hang during boot (while preparing Dom0) on a Westmere >> + * based laptop. >> + */ >> + pt_update_contig_markers(&page->val, >> + address_level_offset(dfn_to_daddr(dfn), level), >> + level, >> + (hd->platform_ops->page_sizes & >> + (1UL << level_to_offset_bits(level + 1)) >> + ? PTE_kind_leaf : PTE_kind_table)); > > So this works because on what we believe to be affected models the > only supported page sizes are 4K? Yes. > Do we want to do the same with AMD if we don't allow 512G super pages? Why? They don't have a similar flaw. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.