[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Enabling NLB is crashing VM's/DRBD


  • To: xen-devel@xxxxxxxxxxxxx
  • From: Greg Zapp <greg.zapp@xxxxxxxxx>
  • Date: Thu, 29 Nov 2012 13:12:50 +1300
  • Delivery-date: Thu, 29 Nov 2012 14:45:41 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hi All,

We have a somewhat serious issue around NLB on Windows 2012 and Xen.  First, let me describe our environment and then I'll let you know what's wrong.

2 X Debian-squeeze boxes running the latest provided AMD64 Xen kernel and about 100GB of RAM.
These boxes are connected via infiniband and DRBD is running over this(IPoIB).
Each VPS runs on a mirrored DRBD devices.
Each DRBD device sits on 2 logical volumes.  One for data and one for metadata.
The hypervisors exclusively run Windows VM's(Server 2008 R2 and 2012).
The VM's are utilizing the GPLPV drivers(PCI,VBD,Net,etc).
We are using network-bridge.

So here is the trouble.  We had somebody trying to setup Windows NLB.  When adding a host it would cause the VM to freeze but also disconnect the DRBD devices.  Everything recovers but the DRBD devices resync and a bunch of VM's on the one side(the side with the VM that hangs up) get rebooted by Xen.  Here is what we are seeing in messages:

eth0: port 3(nlb2.e0) entering disabled state
eth0: port 3(nlb2.e0) entering disabled state
frontend_changed: backend/vif/65/0: prepare for reconnect
device nlb.e0 entered promiscuous mode
block drbd29: sock was shut down by peer
block drbd29: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
block drbd24: sock was shut down by peer
block drbd24: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
block drbd29: Creating new current UUID
block drbd30: sock was shut down by peer
block drbd30: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
.... and on and on and on with the DRBD disconnecting
block drbd29: md_sync_timer expired! Worker calls drbd_md_sync().
block drbd21: md_sync_timer expired! Worker calls drbd_md_sync().
.... lots of that
block drbd24: Terminating drbd24_asender
block drbd21: asender terminated
block drbd21: Terminating drbd21_asender
....
eth0: port 3(nlb2.e0) entering forwarding state
....
block drbd1: Handshake successful: Agreed network protocol version 91
block drbd1: conn( WFConnection -> WFReportParams )
block drbd38: Handshake successful: Agreed network protocol version 91
block drbd38: conn( WFConnection -> WFReportParams )
block drbd38: Starting asender thread (from drbd38_receiver [16250])
block drbd1: Starting asender thread (from drbd1_receiver [18278])
... Then lots of stuff for the DRBD devices reconnecting and syncing.


This happened three times, each time the user was attempting to add the second node into NLB.  I can reproduce the network adapter dying(Becomes disabled and is unusable until reboot) in the lab on Server 2012 unless I follow specific steps, but not the DRBD dying.  I can get NLB working but I'm mostly concerned about one persons ability to effectively crash 8 other VM's.  It looks like whatever is going on is somehow effecting my DRBD connection.  Has anyone seen anything like this before?


Thanks,
   Greg
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.