[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Enabling NLB is crashing VM's/DRBD



On Thu, Nov 29, 2012 at 01:12:50PM +1300, Greg Zapp wrote:
>    Hi All,
>

Hello,
 
>    We have a somewhat serious issue around NLB on Windows 2012 and Xen.
>    First, let me describe our environment and then I'll let you know what's
>    wrong.
> 
>    2 X Debian-squeeze boxes running the latest provided AMD64 Xen kernel and
>    about 100GB of RAM.

You haven't provided enough information..

- What Xen version are you running? 
- What dom0 kernel version are you running? 


>    These boxes are connected via infiniband and DRBD is running over
>    this(IPoIB).
>    Each VPS runs on a mirrored DRBD devices.
>    Each DRBD device sits on 2 logical volumes.  One for data and one for
>    metadata.
>    The hypervisors exclusively run Windows VM's(Server 2008 R2 and 2012).
>    The VM's are utilizing the GPLPV drivers(PCI,VBD,Net,etc).
>    We are using network-bridge.
> 
>    So here is the trouble.  We had somebody trying to setup Windows NLB.
>    When adding a host it would cause the VM to freeze but also disconnect the
>    DRBD devices.  Everything recovers but the DRBD devices resync and a bunch
>    of VM's on the one side(the side with the VM that hangs up) get rebooted
>    by Xen.  Here is what we are seeing in messages:
> 
>    eth0: port 3(nlb2.e0) entering disabled state
>    eth0: port 3(nlb2.e0) entering disabled state
>    frontend_changed: backend/vif/65/0: prepare for reconnect
>    device nlb.e0 entered promiscuous mode
>    block drbd29: sock was shut down by peer
>    block drbd29: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe )
>    pdsk( UpToDate -> DUnknown )
>    block drbd24: sock was shut down by peer
>    block drbd24: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe )
>    pdsk( UpToDate -> DUnknown )
>    block drbd29: Creating new current UUID
>    block drbd30: sock was shut down by peer
>    block drbd30: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe )
>    pdsk( UpToDate -> DUnknown )
>    .... and on and on and on with the DRBD disconnecting
>    block drbd29: md_sync_timer expired! Worker calls drbd_md_sync().
>    block drbd21: md_sync_timer expired! Worker calls drbd_md_sync().
>    .... lots of that
>    block drbd24: Terminating drbd24_asender
>    block drbd21: asender terminated
>    block drbd21: Terminating drbd21_asender
>    ....
>    eth0: port 3(nlb2.e0) entering forwarding state
>    ....
>    block drbd1: Handshake successful: Agreed network protocol version 91
>    block drbd1: conn( WFConnection -> WFReportParams )
>    block drbd38: Handshake successful: Agreed network protocol version 91
>    block drbd38: conn( WFConnection -> WFReportParams )
>    block drbd38: Starting asender thread (from drbd38_receiver [16250])
>    block drbd1: Starting asender thread (from drbd1_receiver [18278])
>    ... Then lots of stuff for the DRBD devices reconnecting and syncing.
> 
>    This happened three times, each time the user was attempting to add the
>    second node into NLB.  I can reproduce the network adapter dying(Becomes
>    disabled and is unusable until reboot) in the lab on Server 2012 unless I
>    follow specific steps, but not the DRBD dying.  I can get NLB working but
>    I'm mostly concerned about one persons ability to effectively crash 8
>    other VM's.  It looks like whatever is going on is somehow effecting my
>    DRBD connection.  Has anyone seen anything like this before?
> 

Does it happen without GPLPV drivers? Try using plain Intel e1000 emulated NICs 
in the Windows VMs.

Any errors in dom0 kernel dmesg? How about in Xen dmesg? 


-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.