[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Kernel 3.7.0-pre-rc1 kernel BUG at drivers/net/xen-netback/netback.c:405 RIP: e030:[<ffffffff814714f9>] [<ffffffff814714f9>] netbk_gop_frag_copy+0x379/0x380
On Mon, 2012-10-08 at 00:34 +0100, Konrad Rzeszutek Wilk wrote: > On Sat, Oct 06, 2012 at 12:20:54AM +0200, Sander Eikelenboom wrote: > > > > Friday, October 5, 2012, 9:26:31 PM, you wrote: > > > > > Sorry for top posting - on mobile. > > > > > I saw it too yesterday but only on a specific hardware - AMD FX8. What > > > type of CPU do you have? Does xsave=off on Xen line help? > > > > Nope the xsave=off doesn't help > > > > > Sander Eikelenboom <linux@xxxxxxxxxxxxxx> wrote: > > > > >>Hi Konrad, > > >> > > >>Just tested kernel 3.7.0-pre-rc1 but ran into a oops in netback on boot > > >>after starting some guests: > > >> > > >>[ 402.723915] ------------[ cut here ]------------ > > >>[ 402.734629] kernel BUG at drivers/net/xen-netback/netback.c:405! > > Looking at the code, this is what we get: > > /* Data must not cross a page boundary. */ > BUG_ON(size + offset > PAGE_SIZE); > > Looking at the commits, the one recently added was: > commit c571898ffc24a1768e1b2dabeac0fc7dd4c14601 > Author: Andres Lagar-Cavilla <andres@xxxxxxxxxxxxxxxx> > Date: Fri Sep 14 14:26:59 2012 +0000 > > xen/gndev: Xen backend support for paged out grant targets V4. > > > But after reverting it and trying the kernel I still got the crash. > > So .. the weirdness is that this seems to be only happening on > certain AMD machines - for example on my AMD A8 box I did not see this. I took a look at this last week and can't repro. The code which calls this function is supposed to ensure that the buffer doesn't cross a page boundary. There are two places which call it, one is looping over the skb's frags, which just can't cross page boundaries and if they did it would be breaking left and right for everyone AFAICT (although I'm very behind on my LKML and netdev reading, so maybe it is ;-)). The other case is processing the SKB's linear data area, which can cross a page boundary but the code loops over it and processes it in chunks which fit in single pages. I was suspicious of this code so I pulled it out into a little userspace test harness and fed it some corner cases but it looked like it was doing the right thing. I speculated that this might be NIC rather than processor related (perhaps there's some weak correlation between certain NICs and certain processor manufacturers). Konrad seems to have an r8169 but the module list wasn't in Sander's output -- do you know what you have? > I fear that the next step is to do a bit off git bisection to > get an idea of which merge it might be. I am going to be AFK > on Monday so I won't get to this until Tuesday/Wednesay :-( > > .. Thought to help speed this process, this looks like a > candidate: > > commit 229993001282e128a49a59ec43d255614775703a > Merge: 7687b80 fd0f586 > Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Date: Mon Oct 1 11:13:33 2012 -0700 > > Merge branch 'x86-mm-for-linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > > Pull x86/mm changes from Ingo Molnar: > "The biggest change is new TLB partial flushing code for AMD CPUs. > (The v3.6 kernel had the Intel CPU side code, see commits > e0ba94f14f74..effee4b9b3b.) Would be interesting to try although I don't think anything in this area is actively messing with page table mappings (that happens later, and doesn't effect the non-data bits of the skb like the sizes and offsets). Perhaps this debug patch might shed some light? PG_compound or THP might be an interesting case? Ian. diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index 05593d8..ca4c47d 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -386,7 +386,7 @@ static struct netbk_rx_meta *get_next_rx_buffer(struct xenvif *vif, * Set up the grant operations for this fragment. If it's a flipping * interface, we also set up the unmap request from here. */ -static void netbk_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb, +static int netbk_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb, struct netrx_pending_operations *npo, struct page *page, unsigned long size, unsigned long offset, int *head) @@ -402,7 +402,8 @@ static void netbk_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb, unsigned long bytes; /* Data must not cross a page boundary. */ - BUG_ON(size + offset > PAGE_SIZE); + if (size + offset > PAGE_SIZE) + return -1; meta = npo->meta + npo->meta_prod - 1; @@ -459,6 +460,7 @@ static void netbk_gop_frag_copy(struct xenvif *vif, struct sk_buff *skb, *head = 0; /* There must be something in this buffer now. */ } + return 0; } /* @@ -517,17 +519,31 @@ static int netbk_gop_skb(struct sk_buff *skb, if (data + len > skb_tail_pointer(skb)) len = skb_tail_pointer(skb) - data; - netbk_gop_frag_copy(vif, skb, npo, - virt_to_page(data), len, offset, &head); + if (netbk_gop_frag_copy(vif, skb, npo, + virt_to_page(data), len, offset, &head) < 0) { +printk(KERN_CRIT "netbk_gop_frag_copy failed: skb head %p-%p\n", + skb->data, skb_tail_pointer); +printk(KERN_CRIT "copying from %p-%p, offset %x, len %x\n", + data, data+len, offset, len); +dump_page(virt_to_page(data)); +BUG(); + } data += len; } for (i = 0; i < nr_frags; i++) { - netbk_gop_frag_copy(vif, skb, npo, + if (netbk_gop_frag_copy(vif, skb, npo, skb_frag_page(&skb_shinfo(skb)->frags[i]), skb_frag_size(&skb_shinfo(skb)->frags[i]), skb_shinfo(skb)->frags[i].page_offset, - &head); + &head) < 0) { +printk(KERN_CRIT "netbk_gop_frag_copy failed: skb frag %d page\n", i); +printk(KERN_CRIT "copying from offset %x, len %x\n", + skb_shinfo(skb)->frags[i].page_offset, + skb_frag_size(&skb_shinfo(skb)->frags[i])); +dump_page(skb_frag_page(&skb_shinfo(skb)->frags[i])); +BUG(); + } } return npo->meta_prod - old_meta_prod; _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |