[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] netback BUG_ON when using copy_skb=1



On 2013/10/16 19:10, Jan Beulich wrote:
>>>> On 16.10.13 at 06:13, jerry <jerry.lilijun@xxxxxxxxxx> wrote:
>> Hi Wei Liu,
>>
>> I am doing some network performance on Xen4.1.2 and kernel 3.0, and get a 
>> crash with BUG_ON(netbk->mmap_pages[idx] != page) in netbk_gop_frag() 
>> accidentally.
>>
>> By analyzing the module drivers/xen/netback,
> 
> You aren't looking at the upstream driver, are you? If so, Wei is
> very likely the wrong addressee.
> 
> Assuming that you instead talk of the SLE11 kernel, I can only
> point out that a problem in that code was found and fixed a
> couple of months ago (resulting in the BUG_ON() you quoted not
> being there anymore), so you're simply not looking at up-to-date
> code.
> 
> Jan
> 
>> I think the reason is as 
>> follows when sending packets from VM1 to VM2:
>> 1) The two netback thread(the first for VM1 sending, second for VM2 
>> receiving) run concurrently.
>> 2) In first netback thread, it will do delayed copy from a foreign granted 
>> page to local memory when some outstanding packets have been pending too 
>> long( above half of one HZ).
>>    Then netbk->mmap_pages[idx] will be replaced with new allocated page.
>> 3) If the packets are forwarded to VM2 by virtual switch, netbk_gop_frag() 
>> will be called in second netback thread.
>>    And that function will judge whether the pages in skb frags[] is foreign 
>> in order to make sure how to do grant copy.
>> 4) If the page replacing was done after the page foreign judge in 
>> netbk_gop_frag(), the BUG will be invoked because the page from skb frags[] 
>> are different with mmap_pages[idx].
>>
>> I tried to using spin_lock to protect the page accessing, but no appropriate 
>> solutions was found.
>> How to fix this problem?  Would you like to share some opinions?
>>
>> In addition, I have tried to turn off copy_skb. Then the vif netdevice may 
>> not be released after shutting down VM,
>> that's because outstanding packets hold the reference count of the device 
>> too long for some unknown reason.
>> The reason may be that the NIC does not release packets after DMA.
>> Does anyone have met such problems? Thanks.
>>

The reason why the vif net-device isn't released after shutting down VM was 
found with copy_skb disabled.
Let it be supposed that VM1(vif1.0) sends packets to VM2(vif2.0) by virtual 
switch.
1) The VM2's OS is windows 2003 and has been shutdown before for some 
unexpected reason.
    After being created, this VM2 stopped the starting process at the prompt 
windows named "Shutdown Event Tracker".
   It is waiting for users to input some messages for the question why the 
computer shut down unexpectedly.

2) The VM2 already has vif2.0 created. Then I added a new vif net-device using 
virsh commands.
  The new vif2.1 was not completely created with no interrupts, but its state 
is running and tx queues is started as default.
   The function connect() in xenbus.c hasn't been called for vif2.1. The 
related information in xenstore is as follows:
linux-szRoyS:/ # xenstore-ls -f | grep 2 | grep state
/local/domain/0/device-model/2/state = "running"
/local/domain/0/backend/vbd/2/51712/state = "4"
/local/domain/0/backend/vbd/2/51760/state = "4"
/local/domain/0/backend/vif/2/0/state = "4"
/local/domain/0/backend/vif/2/1/state = "2"
/local/domain/0/backend/console/2/0/state = "1"
/local/domain/2/control/uvp/vm_state = "running"
/local/domain/2/device/vbd/51712/state = "4"
/local/domain/2/device/vbd/51760/state = "4"
/local/domain/2/device/vif/0/state = "4"
/local/domain/2/device/vif/1/state = "1"

3)  The KOBJ_ONLINE message was generated in function backend_create_netif() 
called in netback_probe().
    This event will invoke network script named "vif-bridge" executing and add 
vif2.1 to virtual switch.
    Then packets from vif1.0(VM1) will be forwarded or flooded to vif2.1 by 
virtual switch.
    The vif2.1 dropped this packets because its not netif_schedulable() in 
function netif_be_start_xmit().

4)  After setting vif2.1 to down and then to up, the TX queue can't be started 
in net_open() with carrier off.
    So its qdisc became fifo_qdic and the TX queue state stopped.
    In this case, the packets will be held in qdisc queue and can't be dequeued 
in function dequeue_skb()
    for vif2.1's stopped TX queues.

5)  If VM1 was destroyed, the packets from vif1.0 can't be released and vif1.0 
can't be disconnected.
    The vif1.0 will be remained unreleased until setting vif2.1 to down.

   This problem is mainly because that vif2.1 was not created successfully and 
got in a strange state:
   running but TX queue is stopped. The function backend_create_netif() is 
called in two place netback_probe() and
   frontend_changed(). I think we can remove the backend_create_netif() call in 
netback_probe().
   So we can make sure the vif net-device created completely after front-end 
changed to XenbusStateConnected.

   The patch is as follows:
--- drivers/xen/netback/xenbus.c.old    2013-10-26 16:23:07.000000000 +0800
+++ drivers/xen/netback/xenbus.c        2013-10-26 16:23:31.000000000 +0800
@@ -156,9 +156,6 @@
        if (err)
                goto fail;

-       /* This kicks hotplug scripts, so do it immediately. */
-       backend_create_netif(be);
-
        return 0;

 abort_transaction:

   Do you have some ideas?

>> Best regards,
>> Jerry
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@xxxxxxxxxxxxx 
>> http://lists.xen.org/xen-devel 
> 
> 
> 
> 
> .
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.