[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] ADs over dom0 iSCSI = high page_count()
I've come across a disturbing page ref count situation and need some advice. This only happens very rarely, when writing through ADs to iSCSI storage. (My guess is that this is probably during a tcp fragmented retransmit.) Novell SLES10sp2 kernel :: Xen 3.2, but all of the blkback and netback code is the same as unstable. 1. blkback :: maps the foreign page :: page_count() == 1 2. blkback :: submits a bio with this foreign page 3. iscsi_tcp :: makes a tcp request with this foreign page 4. tcp :: gets twice, page_count() == 3 5. tcp :: puts once, page_count() == 2 6. tcp :: gets twice, page_count() == 4 7. __gnttab_dma_map_page(), sets page_mapcount() == 1 8. tcp :: puts twice, page_count() == 2 9. tcp :: done, but page_count() == 2, not 1 10. iscsi_tcp :: done bio completes 11. blkback :: __end_block_io_op() call fast_flush_area() page state: page_count() == 2, page_mapcount() == 1 BUT: page_count() should be 1 and page_mapcount() should be 0 Perhaps these two counts are related, but I'm wondering if these might be two separate issues. However, in all of my reproductions of this issue, if __gnttab_dma_map_page() gets called, then it is the case where the page_count() is high. QUESTION 1: Is having the page_count() be high after leaving the tcp layer when the packets are fragmented, a known unsolved problem? Looking at netback.c I see the comment in the read path: net_rx_action() /* We can't rely on skb_release_data to release the pages used by fragments for us, since it tries to touch the pages in the fraglist. If we're in flipping mode, that doesn't work. In copying mode, we still have access to all of the pages, and so it's safe to let release_data deal with it. */ /* (Freeing the fragments is safe since we copy non-linear skbs destined for flipping interfaces) */ Also in netback.c in net_tx_action_dealloc() after make_tx_response() I see: /* Ready for next use. */ gnttab_reset_grant_page()Sure this resets the page_mapcount() back to 0, but it also sets the page_count() to 1 regardless of the current value. QUESTION 2: Why does the page_count() have to be set to 1? QUESTION 3: If the page_count() is known to be high after leaving the tcp layer by only 1 ( ie. page_count() == 2 instead of being 1 ), then wouldn't a atomic_cmpxchg() be safer or can the count be even higher? I can add a call to gnttab_reset_grant_page() in blkback. However, we have found legitimate cases where the page_count() is 2, such as when dhcpd is sniffing for a release_renew while there are IOs in progress. Thus I'd like more understanding before setting the page_count(). Thank you, Joshua PS: Below is a more detailed walk through the get_page, put_page calls, which were made resulting in the page_count() being high. PSS:The thread originally discussing dhcpd SEGV when dhcpd is loses the race to when blkback unmaps the page from dom0 is: Problem with PV disk and iSCSI http://lists.xensource.com/archives/html/xen-devel/2008-02/msg00330.html ================================================================ ================================================================ ================================================================ blkback maps the foreign page page_count() == 1 GetPage_Trace [ffff8800087ba6c0] (1) G 1 0 | 562 /srcTrees/na_main/nex.bk/linux/net/ipv4/tcp.c do_tcp_sendpages() !can_coalesce GetPage_Trace [ffff8800087ba6c0] (2) G 2 0 | 1576 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c skb_split_no_header() pos < len /* Split frag. * We have two variants in this case: * 1. Move all the frag to the second * part, if it is possible. F.e. * this approach is mandatory for TUX, * where splitting is expensive. * 2. Split is accurately. We make this. */ | 1134 /srcTrees/na_main/nex.bk/linux/net/ipv4/tcp_output.c tcp_write_xmit() calls tso_fragment() which eventually calls skb_split_no_header() PutPage_Trace [ffff8800087ba6c0] (3) P 3 0 | 281 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c skb_release_data() for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) | 462 /srcTrees/na_main/nex.bk/linux/include/net/sock.h sk_stream_free_skb() calls __kfree_skb() which ecventually calls skb_release_data() ??? second put_page() seems to be missing ??? ================ ??? retransmit maybe ??? ================ GetPage_Trace [ffff8800087ba6c0] (4) G 2 0 | 562 /srcTrees/na_main/nex.bk/linux/net/ipv4/tcp.c do_tcp_sendpages() !can_coalesce GetPage_Trace [ffff8800087ba6c0] (5) G 3 0 | 1576 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c skb_split_no_header() pos < len /* Split frag. * We have two variants in this case: * 1. Move all the frag to the second * part, if it is possible. F.e. * this approach is mandatory for TUX, * where splitting is expensive. * 2. Split is accurately. We make this. */ | 1134 /srcTrees/na_main/nex.bk/linux/net/ipv4/tcp_output.c tcp_write_xmit() calls tso_fragment() which eventually calls skb_split_no_header() dma_map_single() swiotlb_map_single() gnttab_dma_map_page() __gnttab_dma_map_page() In: drivers/xen/core/gnttab.c page->_mapcount gets set (Not an increment, but like a flag) Sometimes this gets called multiple times for this same page PutPage_Trace [ffff8800087ba6c0] (6) P 4 1 | 281 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c skb_release_data() for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) | 462 /srcTrees/na_main/nex.bk/linux/include/net/sock.h sk_stream_free_skb() calls __kfree_skb() which ecventually calls skb_release_data() PutPage_Trace [ffff8800087ba6c0] (7) P 3 1 | 281 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c skb_release_data() for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) | 462 /srcTrees/na_main/nex.bk/linux/include/net/sock.h sk_stream_free_skb() calls __kfree_skb() which ecventually calls skb_release_data() ================================================================ Joshua Nicolas Virtual Iron Software, Inc. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |