[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] iommu/quirk: disable shared EPT for Sandybridge and earlier processors.



On 01/12/15 15:24, Konrad Rzeszutek Wilk wrote:
> On Tue, Dec 01, 2015 at 10:34:17AM +0000, Andrew Cooper wrote:
>> On 30/11/15 21:22, Konrad Rzeszutek Wilk wrote:
>>> On Thu, Nov 26, 2015 at 01:55:57PM +0000, Andrew Cooper wrote:
>>>> On 26/11/15 13:48, Malcolm Crossley wrote:
>>>>> On 26/11/15 13:46, Jan Beulich wrote:
>>>>>>>>> On 25.11.15 at 11:28, <andrew.cooper3@xxxxxxxxxx> wrote:
>>>>>>> The problem is that SandyBridge IOMMUs advertise 2M support and do
>>>>>>> function with it, but cannot cache 2MB translations in the IOTLBs.
>>>>>>>
>>>>>>> As a result, attempting to use 2M translations causes substantially
>>>>>>> worse performance than 4K translations.
>>>>>> Btw - how does this get explained? At a first glance, even if 2Mb
>>>>>> translations don't get entered into the TLB, it should still be one
>>>>>> less page table level to walk for the IOMMU, and should hence
>>>>>> nevertheless be a benefit. Yet you even say _substantially_
>>>>>> worse performance results.
>>>>> There is a IOTLB for the 4K translation so if you only use 4K
>>>>> translations then you get to take advantage of the IOTLB.
>>>>>
>>>>> If you use the 2Mb translation then a page table walk has to be
>>>>> performed every time there's a DMA access to that region of the BFN
>>>>> address space.
>>>> Also remember that a high level dma access (from the point of view of a
>>>> driver) will be fragmented at the PCIe max packet size, which is
>>>> typically 256 bytes.
>>>>
>>>> So by not caching the 2Mb translation, a dma access of 4k may undergo 16
>>>> pagetable walks, one for each PCIe packet.
>>>>
>>>> We observed that using 2Mb mappings results in a 40% overhead, compared
>>>> to using 4k mappings, from the point of view of a sample network workload.
>>> How did you observe this? I am mighty curious what kind of performance tools
>>> you used to find this  as I would love to figure out if some of the issues
>>> we have seen are related to this?
>> The 40% difference is just in terms of network throughput of a VF, given
>> a workload which can normally saturate line rate on the card.
> I understand that.
>
> But I am curious on how you found out the page walks by the IOMMU were
> so excessive?

I didn't.  It is all speculation drawn from other information.

The manual states that there is not a superpage IOTLB.

This leaves two options
1) 2M mappings are entirely uncached
2) 2M mappings are shattered to 4K mappings and cached

The fact there is a 40% performance reduction suggests 1 rather than 2.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.