Xen project Mailing List

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles

To: Wei Liu <wei.liu2@xxxxxxxxxx>, Sander Eikelenboom <linux@xxxxxxxxxxxxxx>

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Thu, 27 Feb 2014 16:36:52 +0100

Cc: annie li <annie.li@xxxxxxxxxx>, Paul Durrant <Paul.Durrant@xxxxxxxxxx>, Zoltan Kiss <zoltan.kiss@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Thu, 27 Feb 2014 15:37:00 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 27/02/14 15:18, Wei Liu wrote: > On Wed, Feb 26, 2014 at 04:11:23PM +0100, Sander Eikelenboom wrote: >> >> Wednesday, February 26, 2014, 10:14:42 AM, you wrote: >> >> >>> Friday, February 21, 2014, 7:32:08 AM, you wrote: >> >> >>>> On 2014/2/20 19:18, Sander Eikelenboom wrote: >>>>> Thursday, February 20, 2014, 10:49:58 AM, you wrote: >>>>> >>>>> >>>>>> On 2014/2/19 5:25, Sander Eikelenboom wrote: >>>>>>> Hi All, >>>>>>> >>>>>>> I'm currently having some network troubles with Xen and recent linux >>>>>>> kernels. >>>>>>> >>>>>>> - When running with a 3.14-rc3 kernel in dom0 and a 3.13 kernel in domU >>>>>>> I get what seems to be described in this thread: >>>>>>> http://www.spinics.net/lists/netdev/msg242953.html >>>>>>> >>>>>>> In the guest: >>>>>>> [57539.859584] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [57539.859599] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [57539.859605] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [57539.859610] net eth0: Need more slots >>>>>>> [58157.675939] net eth0: Need more slots >>>>>>> [58725.344712] net eth0: Need more slots >>>>>>> [61815.849180] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [61815.849205] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [61815.849216] net eth0: rx->offset: 0, size: 4294967295 >>>>>>> [61815.849225] net eth0: Need more slots >>>>>> This issue is familiar... and I thought it get fixed. >>>>>> From original analysis for similar issue I hit before, the root cause >>>>>> is netback still creates response when the ring is full. I remember >>>>>> larger MTU can trigger this issue before, what is the MTU size? >>>>> In dom0 both for the physical nics and the guest vif's MTU=1500 >>>>> In domU the eth0 also has MTU=1500. >>>>> >>>>> So it's not jumbo frames .. just everywhere the same plain defaults .. >>>>> >>>>> With the patch from Wei that solves the other issue, i'm still seeing the >>>>> Need more slots issue on 3.14-rc3+wei's patch now. >>>>> I have extended the "need more slots warn" to also print the cons, slots, >>>>> max, rx->offset, size, hope that gives some more insight. >>>>> But it indeed is the VM were i had similar issues before, the primary >>>>> thing this VM does is 2 simultaneous rsync's (one push one pull) with >>>>> some gigabytes of data. >>>>> >>>>> This time it was also acompanied by a "grant_table.c:1857:d0 Bad grant >>>>> reference " as seen below, don't know if it's a cause or a effect though. >> >>>> The log "grant_table.c:1857:d0 Bad grant reference " was also seen before. >>>> Probably the response overlaps the request and grantcopy return error >>>> when using wrong grant reference, Netback returns resp->status with >>>> ||XEN_NETIF_RSP_ERROR(-1) which is 4294967295 printed above from frontend. >>>> Would it be possible to print log in xenvif_rx_action of netback to see >>>> whether something wrong with max slots and used slots? >> >>>> Thanks >>>> Annie >> >>> Looking more closely it are perhaps 2 different issues ... the bad grant >>> references do not happen >>> at the same time as the netfront messages in the guest. >> >>> I added some debugpatches to the kernel netback, netfront and xen >>> granttable code (see below) >>> One of the things was to simplify the code for the debug key to print the >>> granttables, the present code >>> takes too long to execute and brings down the box due to stalls and NMI's. >>> So it now only prints >>> the nr of entries per domain. >> >> >>> Issue 1: grant_table.c:1858:d0 Bad grant reference >> >>> After running the box for just one night (with 15 VM's) i get these >>> mentions of "Bad grant reference". >>> The maptrack also seems to increase quite fast and the number of entries >>> seem to have gone up quite fast as well. >> >>> Most domains have just one disk(blkfront/blkback) and one nic, a few have a >>> second disk. >>> The blk drivers use persistent grants so i would assume it would reuse >>> those and not increase it (by much). >> > > As far as I can tell netfront has a pool of grant references and it > will BUG_ON() if there's no grefs in the pool when you request one. > Since your DomU didn't crash so I suspect the book-keeping is still > intact. > >>> Domain 1 seems to have increased it's nr_grant_entries from 2048 to 3072 >>> somewhere this night. >>> Domain 7 is the domain that happens to give the netfront messages. >> >>> I also don't get why it is reporting the "Bad grant reference" for domain >>> 0, which seems to have 0 active entries .. >>> Also is this amount of grant entries "normal" ? or could it be a leak >>> somewhere ? >> > > I suppose Dom0 expanding its maptrack is normal. I see as well when I > increase the number of domains. But if it keeps increasing while the > number of DomUs stay the same then it is not normal. blkfront/blkback will allocate persistent grants on demand, so it's not strange to see the number of grants increasing while the domain is running (although it should reach a stable state at some point). Roger. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.