[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



Tuesday, March 25, 2014, 4:15:39 PM, you wrote:

> On Sat, Mar 22, 2014 at 07:28:34PM +0100, Sander Eikelenboom wrote:
> [...]
>> > Yes there is only one frag .. but it seems to be much larger than 
>> > PAGE_SIZE .. and xenvif_gop_frag_copy brakes that frag down into smaller 
>> > bits .. hence the calculation in xenvif_rx_action determining the slots 
>> > needed by doing:
>> 
>> >                 for (i = 0; i < nr_frags; i++) {
>> >                         unsigned int size;
>> >                         size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
>> >                         max_slots_needed += DIV_ROUND_UP(size, PAGE_SIZE);
>> >                 }
>> 
>> > But the code in xenvif_gop_frag_copy .. seems to be needing one more slot 
>> > (from the emperical test) .. and calling "get_next_rx_buffer" seems 
>> > involved in that ..
>> 
>> Hmm looked again .. and it seems this is it .. when your frags are large 
>> enough you have the chance of running into this.
>> 

> get_next_rx_buffer is guarded by start_new_rx_buffer. Do you see any
> problem with that implementation?
In general no, but "get_next_rx_buffer" up's cons .. and the calculations done 
in "xenvif_rx_action" for max_slots_needed to prevent the overrun
don't count in this possibility. So it's not the guarding of 
"start_new_rx_buffer" that is at fault. It's the ones early in 
"xenvif_rx_action".
The ones that were changed by Paul's patch from MAX_SKB_FRAGS to a calculated 
value that should be a "slim fit".

The problem is in determining upfront in "xenvif_rx_action" when and how often 
the "get_next_rx_buffer" path will be taken.
Unless there are other non direct restrictions (from a size point of view) it 
can be called multiple times per frag per skb.

>> Problem is .. i don't see an easy fix, the "one more slot" of the empirical 
>> test doesn't seem to be the worst case either (i think):
>> 
>> - In my case the packets that hit this only have 1 frag, but i could have 
>> had more frags.
>>   I also think you can't rule out the possibility of doing the 
>> "get_next_rx_buffer" for multiple subsequent frags from one packet,
>>   so in the worst (and perhaps even from a single frag since it's looped 
>> over a split of it in what seems PAGE_SIZE pieces.)
>>   - So an exact calculation of how much slots we are going to need for 
>> hitting this "get_next_rx_buffer"  upfront in "xenvif_rx_action" seems 
>> unfeasible.
>>   - A worst case gamble seems impossible either .. if you take multiple 
>> frags * multiple times the "get_next_rx_buffer" ... you would probably be 
>> back at just
>>     setting the needed_slots to MAX_SKB_FRAGS.
>> 
>> - Other thing would be checking for the available slots before doing the  
>> "get_next_rx_buffer" .. how ever .. we don't account for how many slots we 
>> still need to
>>   just process the remaining frags.
>> 

> We've done a worst case estimation for whole SKB (linear area + all
> frags) already, at the very first beginning. That's what
> max_slots_needed is for.

>> - Just remove the whole "get_next_rx_buffer".. just tested it but it seems 
>> the "get_next_rx_buffer" is necessary ..  when i make start_new_rx_buffer 
>> always return false
>>   i hit the bug below :S
>> 
>> So the questions are ... is the "get_next_rx_buffer" part really necessary ?
>>    - if not, what is the benefit of the "get_next_rx_buffer" and does that 
>> outweigh the additional cost of accounting
>>      of needed_slots for the frags that have yet to be processed ?
>>    - if yes, erhmmm what would be the best acceptable solution .. returning 
>> to using MAX_SKB_FRAGS ?
>> 

> I think you need to answer several questions first:
> 1. is max_slots_needed actually large enough to cover whole SKB?
        No it's not if, you end up calling "get_next_rx_buffer" one or multiple 
times when processing the SKB
        you have the chance of overrunning (or be saved because prod gets upped 
before you overrun).

> 2. is the test of ring slot availability acurate?
        Seems to be.

> 3. is the number of ring slots consumed larger than max_slots_needed? (I
>    guess the answer is yes)
        Yes that was the whole point.

> 4. which step in the break-down procedure causes backend to overrun the
>    ring?
        The "get_next_rx_buffer" call .. that is not accounted for (which can 
be called
        multiple times per frag (unless some other none obvious size 
restriction limits this
        to one time per frag or one time per SKB).
        In my errorneous case it is called one time, but it would be nice if 
there would be some theoretical proof
        instead of just some emperical test.


> It doesn't matter how many frags your SKB has and how big a frag is. If
> every step is correct then you're fine. The code you're looking at
> (start_new_rx_buffer / get_next_rx_buffer and friend) needs to be
> carefully examined.

Well it seems it only calls "get_next_rx_buffer" in some special cases .. (and 
that also what i'm seeing because it doesn't happen
continously.

Question is shouldn't this max_needed_slots calc be reverted to what it was 
before 3.14 and take the time in 3.15 to figure out a
the theoretical max (if it can be calculated upfront) .. or another way to 
guarantee the code is able to process the whole SKB  ?

> Wei.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.