[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles



On 27/02/14 15:18, Wei Liu wrote:
> On Wed, Feb 26, 2014 at 04:11:23PM +0100, Sander Eikelenboom wrote:
>>
>> Wednesday, February 26, 2014, 10:14:42 AM, you wrote:
>>
>>
>>> Friday, February 21, 2014, 7:32:08 AM, you wrote:
>>
>>
>>>> On 2014/2/20 19:18, Sander Eikelenboom wrote:
>>>>> Thursday, February 20, 2014, 10:49:58 AM, you wrote:
>>>>>
>>>>>
>>>>>> On 2014/2/19 5:25, Sander Eikelenboom wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I'm currently having some network troubles with Xen and recent linux 
>>>>>>> kernels.
>>>>>>>
>>>>>>> - When running with a 3.14-rc3 kernel in dom0 and a 3.13 kernel in domU
>>>>>>>     I get what seems to be described in this thread: 
>>>>>>> http://www.spinics.net/lists/netdev/msg242953.html
>>>>>>>
>>>>>>>     In the guest:
>>>>>>>     [57539.859584] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [57539.859599] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [57539.859605] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [57539.859610] net eth0: Need more slots
>>>>>>>     [58157.675939] net eth0: Need more slots
>>>>>>>     [58725.344712] net eth0: Need more slots
>>>>>>>     [61815.849180] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [61815.849205] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [61815.849216] net eth0: rx->offset: 0, size: 4294967295
>>>>>>>     [61815.849225] net eth0: Need more slots
>>>>>> This issue is familiar... and I thought it get fixed.
>>>>>>   From original analysis for similar issue I hit before, the root cause
>>>>>> is netback still creates response when the ring is full. I remember
>>>>>> larger MTU can trigger this issue before, what is the MTU size?
>>>>> In dom0 both for the physical nics and the guest vif's MTU=1500
>>>>> In domU the eth0 also has MTU=1500.
>>>>>
>>>>> So it's not jumbo frames .. just everywhere the same plain defaults ..
>>>>>
>>>>> With the patch from Wei that solves the other issue, i'm still seeing the 
>>>>> Need more slots issue on 3.14-rc3+wei's patch now.
>>>>> I have extended the "need more slots warn" to also print the cons, slots, 
>>>>> max,  rx->offset, size, hope that gives some more insight.
>>>>> But it indeed is the VM were i had similar issues before, the primary 
>>>>> thing this VM does is 2 simultaneous rsync's (one push one pull) with 
>>>>> some gigabytes of data.
>>>>>
>>>>> This time it was also acompanied by a "grant_table.c:1857:d0 Bad grant 
>>>>> reference " as seen below, don't know if it's a cause or a effect though.
>>
>>>> The log "grant_table.c:1857:d0 Bad grant reference " was also seen before.
>>>> Probably the response overlaps the request and grantcopy return error 
>>>> when using wrong grant reference, Netback returns resp->status with 
>>>> ||XEN_NETIF_RSP_ERROR(-1) which is 4294967295 printed above from frontend.
>>>> Would it be possible to print log in xenvif_rx_action of netback to see 
>>>> whether something wrong with max slots and used slots?
>>
>>>> Thanks
>>>> Annie
>>
>>> Looking more closely it are perhaps 2 different issues ... the bad grant 
>>> references do not happen
>>> at the same time as the netfront messages in the guest.
>>
>>> I added some debugpatches to the kernel netback, netfront and xen 
>>> granttable code (see below)
>>> One of the things was to simplify the code for the debug key to print the 
>>> granttables, the present code
>>> takes too long to execute and brings down the box due to stalls and NMI's. 
>>> So it now only prints
>>> the nr of entries per domain.
>>
>>
>>> Issue 1: grant_table.c:1858:d0 Bad grant reference
>>
>>> After running the box for just one night (with 15 VM's) i get these 
>>> mentions of "Bad grant reference".
>>> The maptrack also seems to increase quite fast and the number of entries 
>>> seem to have gone up quite fast as well.
>>
>>> Most domains have just one disk(blkfront/blkback) and one nic, a few have a 
>>> second disk.
>>> The blk drivers use persistent grants so i would assume it would reuse 
>>> those and not increase it (by much).
>>
> 
> As far as I can tell netfront has a pool of grant references and it
> will BUG_ON() if there's no grefs in the pool when you request one.
> Since your DomU didn't crash so I suspect the book-keeping is still
> intact.
> 
>>> Domain 1 seems to have increased it's nr_grant_entries from 2048 to 3072 
>>> somewhere this night.
>>> Domain 7 is the domain that happens to give the netfront messages.
>>
>>> I also don't get why it is reporting the "Bad grant reference" for domain 
>>> 0, which seems to have 0 active entries ..
>>> Also is this amount of grant entries "normal" ? or could it be a leak 
>>> somewhere ?
>>
> 
> I suppose Dom0 expanding its maptrack is normal. I see as well when I
> increase the number of domains. But if it keeps increasing while the
> number of DomUs stay the same then it is not normal.

blkfront/blkback will allocate persistent grants on demand, so it's not
strange to see the number of grants increasing while the domain is
running (although it should reach a stable state at some point).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.