[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Enabling NLB is crashing VM's/DRBD

To: xen-devel@xxxxxxxxxxxxx
From: Greg Zapp <greg.zapp@xxxxxxxxx>
Date: Thu, 29 Nov 2012 13:12:50 +1300
Delivery-date: Thu, 29 Nov 2012 14:45:41 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hi All,

We have a somewhat serious issue around NLB on Windows 2012 and Xen. First, let me describe our environment and then I'll let you know what's wrong.

2 X Debian-squeeze boxes running the latest provided AMD64 Xen kernel and about 100GB of RAM.
These boxes are connected via infiniband and DRBD is running over this(IPoIB).
Each VPS runs on a mirrored DRBD devices.
Each DRBD device sits on 2 logical volumes. One for data and one for metadata.
The hypervisors exclusively run Windows VM's(Server 2008 R2 and 2012).
The VM's are utilizing the GPLPV drivers(PCI,VBD,Net,etc).
We are using network-bridge.

So here is the trouble. We had somebody trying to setup Windows NLB. When adding a host it would cause the VM to freeze but also disconnect the DRBD devices. Everything recovers but the DRBD devices resync and a bunch of VM's on the one side(the side with the VM that hangs up) get rebooted by Xen. Here is what we are seeing in messages:

eth0: port 3(nlb2.e0) entering disabled state
eth0: port 3(nlb2.e0) entering disabled state
frontend_changed: backend/vif/65/0: prepare for reconnect
device nlb.e0 entered promiscuous mode
block drbd29: sock was shut down by peer
block drbd29: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
block drbd24: sock was shut down by peer
block drbd24: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
block drbd29: Creating new current UUID
block drbd30: sock was shut down by peer
block drbd30: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
.... and on and on and on with the DRBD disconnecting
block drbd29: md_sync_timer expired! Worker calls drbd_md_sync().
block drbd21: md_sync_timer expired! Worker calls drbd_md_sync().
.... lots of that
block drbd24: Terminating drbd24_asender
block drbd21: asender terminated
block drbd21: Terminating drbd21_asender
....
eth0: port 3(nlb2.e0) entering forwarding state
....
block drbd1: Handshake successful: Agreed network protocol version 91
block drbd1: conn( WFConnection -> WFReportParams )
block drbd38: Handshake successful: Agreed network protocol version 91
block drbd38: conn( WFConnection -> WFReportParams )
block drbd38: Starting asender thread (from drbd38_receiver [16250])
block drbd1: Starting asender thread (from drbd1_receiver [18278])
... Then lots of stuff for the DRBD devices reconnecting and syncing.

This happened three times, each time the user was attempting to add the second node into NLB. I can reproduce the network adapter dying(Becomes disabled and is unusable until reboot) in the lab on Server 2012 unless I follow specific steps, but not the DRBD dying. I can get NLB working but I'm mostly concerned about one persons ability to effectively crash 8 other VM's. It looks like whatever is going on is somehow effecting my DRBD connection. Has anyone seen anything like this before?

Thanks,
Greg

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] Enabling NLB is crashing VM's/DRBD
  - From: Pasi Kärkkäinen

Prev by Date: Re: [Xen-devel] Patches for v3.8
Next by Date: Re: [Xen-devel] [PATCH] xl: Suppress spurious warning message for cpupool-list
Previous by thread: [Xen-devel] Mem_event API and MEM_EVENT_REASON_SINGLESTEP
Next by thread: Re: [Xen-devel] Enabling NLB is crashing VM's/DRBD
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.