[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH] page_alloc: use first half of higher order chunks when halving



On Mon, Mar 31, 2014 at 08:25:43PM -0700, Matthew Rushton wrote:
> On 03/31/14 07:15, Konrad Rzeszutek Wilk wrote:
> >On Fri, Mar 28, 2014 at 03:06:23PM -0700, Matthew Rushton wrote:
> >>On 03/28/14 10:02, Konrad Rzeszutek Wilk wrote:
> >>>On Wed, Mar 26, 2014 at 03:15:42PM -0700, Matthew Rushton wrote:
> >>>>On 03/26/14 10:56, Konrad Rzeszutek Wilk wrote:
> >>>>>On Wed, Mar 26, 2014 at 10:47:44AM -0700, Matthew Rushton wrote:
> >>>>>>On 03/26/14 09:36, Konrad Rzeszutek Wilk wrote:
> >>>>>>>On Wed, Mar 26, 2014 at 08:59:04AM -0700, Matthew Rushton wrote:
> >>>>>>>>On 03/26/14 08:15, Matt Wilson wrote:
> >>>>>>>>>On Wed, Mar 26, 2014 at 11:08:01AM -0400, Konrad Rzeszutek Wilk 
> >>>>>>>>>wrote:
> >>>>>>>>>>Could you elaborate a bit more on the use-case please?
> >>>>>>>>>>My understanding is that most drivers use a scatter gather list - 
> >>>>>>>>>>in which
> >>>>>>>>>>case it does not matter if the underlaying MFNs in the PFNs spare 
> >>>>>>>>>>are
> >>>>>>>>>>not contingous.
> >>>>>>>>>>
> >>>>>>>>>>But I presume the issue you are hitting is with drivers doing 
> >>>>>>>>>>dma_map_page
> >>>>>>>>>>and the page is not 4KB but rather large (compound page). Is that 
> >>>>>>>>>>the
> >>>>>>>>>>problem you have observed?
> >>>>>>>>>Drivers are using very large size arguments to dma_alloc_coherent()
> >>>>>>>>>for things like RX and TX descriptor rings.
> >>>>>>>Large size like larger than 512kB? That would also cause problems
> >>>>>>>on baremetal then when swiotlb is activated I believe.
> >>>>>>I was looking at network IO performance so the buffers would not
> >>>>>>have been that large. I think large in this context is relative to
> >>>>>>the 4k page size and the odds of the buffer spanning a page
> >>>>>>boundary. For context I saw ~5-10% performance increase with guest
> >>>>>>network throughput by avoiding bounce buffers and also saw dom0 tcp
> >>>>>>streaming performance go from ~6Gb/s to over 9Gb/s on my test setup
> >>>>>>with a 10Gb NIC.
> >>>>>OK, but that would not be the dma_alloc_coherent ones then? That sounds
> >>>>>more like the generic TCP mechanism allocated 64KB pages instead of 4KB
> >>>>>and used those.
> >>>>>
> >>>>>Did you try looking at this hack that Ian proposed a long time ago
> >>>>>to verify that it is said problem?
> >>>>>
> >>>>>https://lkml.org/lkml/2013/9/4/540
> >>>>>
> >>>>Yes I had seen that and intially had the same reaction but the
> >>>>change was relatively recent and not relevant. I *think* all the
> >>>>coherent allocations are ok since the swiotlb makes them contiguous.
> >>>>The problem comes with the use of the streaming api. As one example
> >>>>with jumbo frames enabled a driver might use larger rx buffers which
> >>>>triggers the problem.
> >>>>
> >>>>I think the right thing to do is to make the dma streaming api work
> >>>>better with larger buffers on dom0. That way it works across all
> >>>OK.
> >>>>drivers and device types regardless of how they were designed.
> >>>Can you point me to an example of the DMA streaming API?
> >>>
> >>>I am not sure if you mean 'streaming API' as scatter gather operations
> >>>using DMA API?
> >>>
> >>>Is there a particular easy way for me to reproduce this. I have
> >>>to say I hadn't enabled Jumbo frame on my box since I am not even
> >>>sure if the switch I have can do it. Is there a idiots-punch-list
> >>>of how to reproduce this?
> >>>
> >>>Thanks!
> >>By streaming API I'm just referring to drivers that use
> >>dma_map_single/dma_unmap_single on every buffer instead of using
> >>coherent allocations. So not related to sg in my case. If you want
> >>an example of this you can look at the bnx2x Broadcom driver. To
> >>reproduce this at a minimum you'll need to have:
> >>
> >>1) Enough dom0 memory so it overlaps with PCI space and gets
> >>remapped by Linux at boot
> >Hm? Could you give a bit details? As in is the:
> >
> >[    0.000000] Allocating PCI resources starting at 7f800000 (gap: 
> >7f800000:7c800000)
> >
> >value?
> >
> >As in that value should be in the PCI space and I am not sure
> >how your dom0 memory overlaps? If you do say dom0_mem=max:3G
> >the kernel will balloon out of the MMIO regions and the gaps (so PCI space)
> >and put that memory past the 4GB. So the MMIO regions end up
> >being MMIO regions.
> 
> You should see the message from xen_do_chunk() about adding pages
> back. Something along the lines of:
> 
> Populating 380000-401fb6 pfn range: 542250 pages added
> 
> These pages get added in reverse order (mfns reversed) without my
> proposed Xen change.
> 
> >>2) A driver that uses dma_map_single/dma_unmap_single
> >OK,
> >>3) Large enough buffers so that they span page boundaries
> >Um, right, so I think the get_order hack that was posted would
> >help in that so you would not span page boundaries?
> 
> That patch doesn't apply in my case but in principal you're right,
> any change that would decrease buffers spanning page boundaries
> would limit bounce buffer usage.
> 
> >>Things that may help with 3 are enabling jumbos and various offload
> >>settings in either guests or dom0.
> >If you booted baremetal with 'iommu=soft swiotlb=force' the same
> >problem should show up - at least based on the 2) and 3) issue.
> >
> >Well, except that there are no guests but one should be able to trigger
> >this.
> 
> If that forces the use of bounce buffers than it would be a similar
> net result if you wanted to see the performance overhead of doing
> the copies.
> 
> >What do you use for driving traffic? iperf with certain parameters?
> 
> I was using netperf. There weren't any magic params to trigger this.
> I believe with the default tcp stream test I ran into the issue.
> 
> 
> >
> >Thanks!
> 
> Are there any concerns about the proposed Xen change as a reasonable
> work around for the current implementation? Thank you!

So I finally understood what the concern was about it - the balloon
mechanics get the pages in worst possible order. I am wondeirng if there
is something on the Linux side we can do to tell Xen to give them to use
in the proper order?

Could we swap the order of xen_do_chunk so it starts from the end and
goes to start? Would that help? Or maybe do an array of 512 chunks (I
had an prototype patch like that floating around to speed this up)?

> 
> >>>>>>>>>--msw
> >>>>>>>>It's the dma streaming api I've noticed the problem with, so
> >>>>>>>>dma_map_single(). Applicable swiotlb code would be
> >>>>>>>>xen_swiotlb_map_page() and range_straddles_page_boundary(). So yes
> >>>>>>>>for larger buffers it can cause bouncing.
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.