[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] VGA Passthrough / Xen 4.2 / Linux 3.9.2


  • To: xen-users@xxxxxxxxxxxxx
  • From: Gordan Bobic <gordan@xxxxxxxxxx>
  • Date: Sun, 12 May 2013 12:19:29 +0100
  • Delivery-date: Sun, 12 May 2013 11:22:00 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

An update in bullet points (sadly without a solution).

* 4.1.x vs. 4.2.x
I tried to test the theory that there was something about Xen 4.1 that made VGA passthrough work better than in 4.2. I built 4.1.4 and it made no difference. Same problems, same symptoms, same BSODs.

* IRQ balancing
This partial workaround still seems to hold true for me - without noirqbalance in the dom0 kernel boot parameters, I generally cannot get as far as the login screen of the domU (estimating at under <10% of the time). With noirqbalance and irqbalance service disabled, I can get that far every time after a fresh host reboot.

This, to me at least, implies some kind of an IRQ routing issue. Has anybody got any suggestions on how to troubleshoot this further and capture any further debug information?

* Screen corruption sometimes preceeding a crash
I attached a screenshot of the desktop after this happens to a bug report here:
http://xen.crc.id.au/bugs/view.php?id=10

To me this implies either a memory stomp going on (aperture alignment?) or an in-flight data corruption on the PCIe bus going on that is specific to virtualization (because bare metal works fine). The idea of in-flight corruption is further corroborated by errors like these in the dom0 syslog:

May 12 11:37:04 normandy kernel: pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: id=0000 May 12 11:51:28 normandy kernel: pcieport 0000:00:07.0: AER: Multiple Uncorrected (Non-Fatal) error received: id=0000

Device 0000:00:07.0 is actually the host IOH PCIe bridge, behind which is the NF200 PCIe router, behind with is the passed through ATI card. See lspci output attached to the bug report here:
http://xen.crc.id.au/bugs/view.php?id=10

* ATI driver
I upgraded from 13.3 beta3 to 13.4 - no obvious difference in reliability.

* Nvidia Quadro
I have a Quadro 2000 which people have reported success with in the past after gailing to get ATI cards to work. If anything my results with the Quadro 2000 were actually worse. I have not, however, yet tested for perfect reproducibility with a Quadro card as I have with the ATI card (see next point - I will update on this later when I have had a chance to try it, hopefully today).

* Reproducibility
I can now reliably reproduce the domU GPU crash following a clean reboot. Fire up Steam, fire up Borderlands 2. Hit Play. Wait.
2K animation plays through.
Gearbox  animation plays through.
Nvidia animation plays through.
Crash (blank screen, a flicker or two as the driver seemingly tries in vain to reset the GPU, AER errors in dom0 syslog, BSOD as attached to this bug report:
http://xen.crc.id.au/bugs/view.php?id=9

What I am pondering now is ways to capture all the PCIe traffic from the domU and from dom0, then re-trying the same thing with bare metal and looking for a difference (unfortunately this involves analyzing GBs of captured PCIe traffic, and right now I'm not even sure how one might go about capturing this).

Has anybody got any suggestions at this point? Should I be taking this to the xen-devel list instead of xen-users?

There has to be a reasonably explainable, logically analyzable issue here, because the behaviour seems pretty consistent - hopefully consistent enough for debugging.

Gordan

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.