Xen project Mailing List

Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Tue, 1 Dec 2015 16:19:24 +0000

Cc: kevin.tian@xxxxxxxxx, Anshul Makkar <anshul.makkar@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Jan Beulich <JBeulich@xxxxxxxx>, yang.z.zhang@xxxxxxxxx, Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>

Delivery-date: Tue, 01 Dec 2015 16:19:39 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 01/12/15 15:24, Konrad Rzeszutek Wilk wrote: > On Tue, Dec 01, 2015 at 10:34:17AM +0000, Andrew Cooper wrote: >> On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote: >>> On Thu, Nov 26, 2015 at 01:55:57PM +0000, Andrew Cooper wrote: >>>> On 26/11/15 13:48, Malcolm Crossley wrote: >>>>> On 26/11/15 13:46, Jan Beulich wrote: >>>>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@xxxxxxxxxx> wrote: >>>>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do >>>>>>> function with it, but cannot cache 2MB translations in the IOTLBs. >>>>>>> >>>>>>> As a result, attempting to use 2M translations causes substantially >>>>>>> worse performance than 4K translations. >>>>>> Btw - how does this get explained? At a first glance, even if 2Mb >>>>>> translations don't get entered into the TLB, it should still be one >>>>>> less page table level to walk for the IOMMU, and should hence >>>>>> nevertheless be a benefit. Yet you even say _substantially_ >>>>>> worse performance results. >>>>> There is a IOTLB for the 4K translation so if you only use 4K >>>>> translations then you get to take advantage of the IOTLB. >>>>> >>>>> If you use the 2Mb translation then a page table walk has to be >>>>> performed every time there's a DMA access to that region of the BFN >>>>> address space. >>>> Also remember that a high level dma access (from the point of view of a >>>> driver) will be fragmented at the PCIe max packet size, which is >>>> typically 256 bytes. >>>> >>>> So by not caching the 2Mb translation, a dma access of 4k may undergo 16 >>>> pagetable walks, one for each PCIe packet. >>>> >>>> We observed that using 2Mb mappings results in a 40% overhead, compared >>>> to using 4k mappings, from the point of view of a sample network workload. >>> How did you observe this? I am mighty curious what kind of performance tools >>> you used to find this as I would love to figure out if some of the issues >>> we have seen are related to this? >> The 40% difference is just in terms of network throughput of a VF, given >> a workload which can normally saturate line rate on the card. > I understand that. > > But I am curious on how you found out the page walks by the IOMMU were > so excessive? I didn't. It is all speculation drawn from other information. The manual states that there is not a superpage IOTLB. This leaves two options 1) 2M mappings are entirely uncached 2) 2M mappings are shattered to 4K mappings and cached The fact there is a 40% performance reduction suggests 1 rather than 2. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.