[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] XEN+RBD(ceph)+qemu+Centos7 HV reboots during W2k12 cloned DomU startup



Hello everyone,


We’ve managed to encounter a situation where we can cause a Xen hypervisor to reboot itself. We may be hitting a bug, or, there is something wrong we are doing.


Summary: a windows 2012 x64 R2 VM can cause the hypervisor to reboot itself during startup of VM.

Reproducible: yes, every time, across 6 hypervisors.


Hardware:


6 servers, Supermicro motherboard X9DRi-LN4+

Bios 3.2, also tested downgrading to 3.0

Each with Two intel X520-DA2, running firmware 4.4.9 and 4.4.13, also downgraded to 2.3.x during testing

Dual Xeon E5-2620v2 or E5-2640 or E5-2697v2

256GBs of DDR3 ECC Reg RAM per hypervisor, all memtested

Dual power feeds to each server


Infrastructure:


Each server has a single 10g uplink to a network dedicated to Ceph

Each server has a single 10g uplink to a network that handles public Internet access as well as private (using vxlan in a separate vlan)

VLANs are used to separate vxlan traffic and public traffic

Each network card has two ports, even though each has a single uplink right now

Two ports for ceph network and public/private network are bonded as LACP; again, only one uplink is connected per card though

We use openvswitch for networking of the VMs, but we have also tried diagnosing this problem without openvswitch and just using Linux bridges

We’ve also tried networking without using LACP


Software:


Linux kernel 4.14 from EL repo, also tried centos-xen kernel 4.9.x

CentOS 7.4

Xen version 4.9, but we’ve also tried as low as 4.6

Disk storage provided by Ceph Luminous 12.2.2

Windows VM has xen drivers version 8.1 (newer version 8.2 for some reason would not work properly with Windows 2016)

IOMMU enabled and also tested with it disabled






Reboot issue occurs when:


- during the start of a cloned, non sysprep version of Windows 2012 R2 x64


Issue does NOT occur:


- With sysprep Windows 2012 R2 x64

- With sysprep or cloned version of Windows 10 Pro x64 and Windows 2016 Standard x64

- With cloned of Windows 2012 R2 x64 with its networking disabled


What do I mean by cloned? A clone of an existing running VM, with its disk being a clone made by Ceph, and we just change the IP numbers of the new copy. Which means, it hasn’t gone through the sysprep process.


Assumption: issue is somehow related to the VM’s networking and non-sysprep version of Windows 2012 R2 x64.


Note: the original template of Windows 2012 R2 x64, as well as Windows Pro 10 and Windows Standard 2016, were created in a hypervisor running on a Xeon E3-1270v2 instead of a Xeon E5.


Purpose:


We've been working on creating a new hypervisor model, with a newer Xen, newer openvswitch, new CentOS 7 and disk storage based on Ceph.


Previously, we were using local disk storage, centos-xen and CentOS 6.


The entire HV reboots itself (no log entries, no entries are sent to the syslog server either) when we deploy a Windows 2012 VM cloned from another one that has been previously created. Sometimes when deploying just one, sometimes when deploying several of them at the same time.


If we boot a sysprepped Windows 2012 R2 x64, the hypervisors don't reboot themselves, and Windows finishes its setup process fine and starts up fine. Yes, we should always use sysprep versions, however, customers can clone existing VMs, so it won't always be sysprep versions. Also, it is our opinion that no matter what happens inside the virtual machine, sysprep or not, Linux or Windows, nothing in a virtual machine should cause an entire hypervisor to reboot itself. That would be a major stability and security issue.


Again, the reboots do not occur if Xen is starting a virtual server booting a sysprep version of Windows 2012 R2 x64, or, if we disable the network interfaces to a VM that would otherwise cause the server to reboot itself.


We added GRUB_CMDLINE_XEN="noreboot" so that the hypervisor would not reboot itself. On the next time we provoke the issue, we see on the screen simply:


BUG: unable to handle kernel paging request at ffff88088141d000


Nothing else.


This is happening to 6 test servers, not just one. We can reproduce it any time.


Any theories? Ideas? Something related to networking causes the entire machine to reboot itself. And only happens with Windows 2012. Not with other Windows and certainly not with Linux. We will gladly provide access to a test system.



Regards

James
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.