[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] rcu_sched self-detect stall when disable vif device



On 28/01/15 17:27, Julien Grall wrote:
> On 28/01/15 17:06, David Vrabel wrote:
>> On 28/01/15 16:45, Julien Grall wrote:
>>> On 27/01/15 16:53, Wei Liu wrote:
>>>> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
>>>>> On 27/01/15 16:45, Wei Liu wrote:
>>>>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> While I'm working on support for 64K page in netfront, I got
>>>>>>> an rcu_sced self-detect message. It happens when netback is
>>>>>>> disabling the vif device due to an error.
>>>>>>>
>>>>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>>>>>>> the processor is stucked in xenvif_rx_queue_purge?
>>>>>>>
>>>>>>
>>>>>> When you try to release a SKB, core network driver need to enter some
>>>>>> RCU cirital region to clean up. dst_release for one, calls call_rcu.
>>>>>
>>>>> But this message shouldn't happen in normal condition or because of
>>>>> netfront. Right?
>>>>>
>>>>
>>>> Never saw  report like this before, even in the case that netfront is
>>>> buggy.
>>>
>>> This is only happening when preemption is not enabled (i.e
>>> CONFIG_PREEMPT_NONE in the config file) in the backend kernel.
>>>
>>> When the vif is disabled, the loop in xenvif_kthread_guest_rx turned
>>> into an infinite loop. In my case, the code executed looks like:
>>>
>>>
>>>  1. for (;;) {
>>>  2.         xenvif_wait_for_rx_work(queue);
>>>  3.
>>>  4. if (kthread_should_stop())
>>>  5.         break;
>>>  6.
>>>  7. if (unlikely(vif->disabled && queue->id == 0) {
>>>  8.         xenvif_carrier_off(vif);
>>>  9.         xenvif_rx_queue_purge(queue);
>>> 10.         continue;
>>> 11. }
>>> 12. }
>>>
>>> The wait on line 2 will return directly because the vif is disabled
>>> (see xenvif_have_rx_work)
>>>
>>> We are on queue 0, so the condition on line 7 is true. Therefore we will
>>> loop on line 10. And so on...
>>>
>>> On platform where preemption is not enabled, this thread will never
>>> yield/give the hand to another thread (unless the domain is destroyed).
>>
>> I'm not sure why we have a continue in the vif->disabled case and not
>> just a break.  Can you try that?
> 
> So I applied this small patches:
> 
> diff --git a/drivers/net/xen-netback/netback.c 
> b/drivers/net/xen-netback/netback.c
> index 908e65e..9448c6c 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -2110,7 +2110,7 @@ int xenvif_kthread_guest_rx(void *data)
>                 if (unlikely(vif->disabled && queue->id == 0)) {
>                         xenvif_carrier_off(vif);
>                         xenvif_rx_queue_purge(queue);
> -                       continue;
> +                       break;
>                 }
>  
>                 if (!skb_queue_empty(&queue->rx_queue))

How about this?

8<------------------------------------------
xen-netback: stop the guest rx thread after a fatal error

After commit e9d8b2c2968499c1f96563e6522c56958d5a1d0d (xen-netback:
disable rogue vif in kthread context), a fatal (protocol) error would
leave the guest Rx thread spinning, wasting CPU time.  Commit
ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e (xen-netback: reintroduce
guest Rx stall detection) made this even worse by removing a
cond_resched() from this path.

A fatal error is non-recoverable so just allow the guest Rx thread to
exit.  This requires taking additional refs to the task so the thread
exiting early is handled safely.

Signed-off-by: David Vrabel <david.vrabel@xxxxxxxxxx>

diff --git a/drivers/net/xen-netback/interface.c
b/drivers/net/xen-netback/interface.c
index 9259a73..037f74f 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -578,6 +578,7 @@ int xenvif_connect(struct xenvif_queue *queue,
unsigned long tx_ring_ref,
                goto err_rx_unbind;
        }
        queue->task = task;
+       get_task_struct(task);

        task = kthread_create(xenvif_dealloc_kthread,
                              (void *)queue, "%s-dealloc", queue->name);
@@ -634,6 +635,7 @@ void xenvif_disconnect(struct xenvif *vif)

                if (queue->task) {
                        kthread_stop(queue->task);
+                       put_task_struct(queue->task);
                        queue->task = NULL;
                }

diff --git a/drivers/net/xen-netback/netback.c
b/drivers/net/xen-netback/netback.c
index 908e65e..c8ce701 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -2109,8 +2109,7 @@ int xenvif_kthread_guest_rx(void *data)
                 */
                if (unlikely(vif->disabled && queue->id == 0)) {
                        xenvif_carrier_off(vif);
-                       xenvif_rx_queue_purge(queue);
-                       continue;
+                       break;
                }

                if (!skb_queue_empty(&queue->rx_queue))


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.