[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Some questions regarding QEMU, UEFI, PCI/VGA Passthrough, and other things

On 06/12/14 00:45, Zir Blazer wrote:

Replying somewhat out of order:

> I hope someone finds my questions interesing to answer.

It is certainly an interesting read - I will answer what I can.

> While I am not a developer myself (I always sucked hard when it comes to read 
> and write code), there are several capabilities of Xen and its supporting 
> Software which I'm always interesed in how they progress, more out of 
> curiosity than anything else. However, usually, documentation seems to 
> backtrack a lot what its currently implemented in code, and sometimes you 
> catch a mail here with some useful data regarding a topic but later you don't 
> hear about that any more, missing any progress, or because the whole topic 
> was inconclusive. So, this mail is pretty much a compilation of small 
> questions of things I came across but didn't popped up later, but can serve 
> to brainstorm someone, which is why I believe it to be more useful for 
> xen-devel than xen-users.

This is indeed more appropriate for xen-devel.  Documentation is
certainly something we are poor at.  The monthly documentation days are
helping to counter this, but it is slow going.

>    QEMU
> Because as a VGA Passthrough user I'm currently forced to use 
> qemu-xen-traditional (Through I hear some success about some users using 
> qemu-xen in Xen 4.4, but I myself didn't had any luck with it), I'm stuck 
> with an old QEMU version. However, looking at changelog from latest versions 
> I always see some interesing features, which as far that I know Xen doesn't 
> currently incorporate.
> 1a - One of the things that newer QEMU versions seems to be capable of doing, 
> is emulating the much newer Intel Q35 Chipset, instead of only the current 
> 440FX from the P5 Pentium era. Some data from Q35 emulation here:
> www.linux-kvm.org/wiki/images/0/06/2012-forum-Q35.pdf
> wiki.qemu.org/Features/Q35
> I'm aware that newer doesn't neccesarily means better, specially because the 
> practical advantages of Q35 vs 440FX aren't very clear. There are several new 
> emulated features like an AHCI Controller and a PCIe Bus, which sounds 
> interesing on paper, but I don't know if they add any useful feature or 
> increases performance/compatibility. Some comments I read about the matter 
> wrongly stated that Q35 would be needed to do PCIe Passthrough, but this is 
> currently possible on 440FX, through I don't know about the low level 
> implementation differences. I think most of the idea about Q35 is to make the 
> VM look more closely to real Hardware, instead of looking like a ridiculous 
> obvious emulated platform.
> In the case of the AHCI Controller, I suppose than the OS would need to 
> include Drivers for the controller during installation time, which if I 
> recall correctly both Windows Vista/7/8 and Linux should have, through for a 
> Windows XP install the Q35 AHCI Controller Drivers should probabily need to 
> be slipstreamed with nLite to an install ISO for it to work.

Qemu traditional with PCI passthrough of a PCIe device makes a PCI
topology which couldn't possibly work electrically speaking.  It ends up
with a PCIe device on a PCI bus with other PCI devices.  It works well
enough because operating systems have to cope with completely bogus
firmware information.

Q35 is certainly newer, and offers a different set of devices which will
be far more commonly found in more modern systems.  Whether this
constitutes "better" is purely subjective.

> A curious thing is that if I check /sys/kernel/iommu_groups/ as stated on the 
> blog I find the folder empty (This is on Dom0, with a DomU with 2 
> passthroughed devices). I suppose it may be VFIO exclusive or something. 
> Point is, after some googling I couldn't find a way to check for IOMMU 
> groups, through Xen doesn't seem to manage that anyways. I think that it may 
> be useful to get a layout of IOMMU groups to at least identify if passthrough 
> issues could be related to that. Anyone can imagine current scenarios where 
> this may break something or limit possible passthrough, why I have my IOMMU 
> groups listing empty, and how to get such list?

Xen has no concept of iommu_groups, and the dom0 kernel doesn't have
blanket permissions to poke around in the system topology, which is
probably why the dom0 kernel doesn't list any.  (In particular, dom0
can't see the ACPI DMAR table as Xen hides it from dom0.)  Try booting
your dom0 kernel as native and seeing whether the groups become populated.

Having read up on iommu_groups, they are a concept Xen should gain, and
"passing through a device" would turn into "passing through an
iommu_group".  Currently, the toolstack (and by implication, admin) can
set up anything it want, even when it doesn't make sense in the
slightest, and the results are a subtly or completely broken device. 
There are also several errata which would cause lots of PCIe devices to
consolidate into the same iommu_group.

IOMMU groups certainly wouldn't fix all passthrough issues, but it would
remove one avenue of being able to set up a known-invalid configurations.

> 1b - Another experimental feature that recently popped in QEMU is IOMMU 
> emulation. Info here:
> www.mulix.org/pubs/iommu/viommu.pdf
> www.linux-kvm.org/wiki/images/4/4a/2010-forum-joro-pv-iommu.pdf
> IOMMU emulation usefulness seems to be so you can do PCI Passthrough in a 
> Nested Virtualization enviroment. At first sight this looked a bit useless, 
> cause using a DomU to do PCI Passthrough with an emulated IOMMU sounds rather 
> too much overhead if you can simply emulate that device in the nested DomU. 
> However, I also read about the possibility of Xen using Hardware 
> virtualization for Dom0 instead of it being Paravirtualized. In that case, 
> would it be possible to provide the IOMMU emulation layer to Dom0 so you 
> could do PCI Passthrough in platforms without proper support for it? It seems 
> a rather interesing idea.
> I think it would also be useful to serve as an standarized debug platform for 
> IOMMU virtualization and passthrough, cause some years ago missing or 
> malformed ACPI DMAR/IVRS tables were all over the place and getting IOMMU 
> virtualization working was pretty much random luck and at the mercy of the 
> goodwill of the Motherboard maker to fix their BIOSes.

IOMMU emulation without IOMMU hardware can only possibly work in
combination with completely emulated devices.

IOMMU emulation in combination with IOMMU hardware could be made to work
if Xen changes its current model of only having a single IOMMU root per

The IOMMU architecture is basically just some sets of pagetables, and
each device gets a "cr3".  Currently, Xen has one single set of
pagetables for each domain needing the IOMMU, and every device assigned
to that domain gets the same set of tables.  It is perfectly possible to
have each device assigned to a domain using a different set of tables,
for intra-vm isolation, or nested pci passthrough, but this would
require a change in Xens interface (and a reasonably large quantity of

>    UEFI for DomUs
> I managed to get this one working, but it seems to need some clarifications 
> here and there.
> 2a - As far that I know, if you add --enable-ovmf to ./configure before 
> building Xen, it downloads and builds some extra code from a OVMF repository 
> which Xen maintains, through I don't know if its a snapshop of whatever the 
> edk2 repository had at that time, or if it does includes custom patchs for 
> the OVMF Firmware to work in Xen. Xen also has another ./configure option, 
> --with-system-ovmf, which is supposed to be used to specify a path to provide 
> an OVMF Firmware binary. However, when I tried that option some months ago, I 
> never managed to get it working, either using a package with a precompiled 
> ovmf.bin from Arch Linux User Repository, or using another package with the 
> source to compile it myself. Both binaries worked with standalone QEMU, 
> through.
> Besides than that parameter itself was quite hidden, there is absolutely no 
> info regarding if the provided OVMF binary has to comply with some special 
> requeriments, be it some custom patchs for OVMF so it works with Xen, if it 
> has to be a binary that only includes TianoCore, or the unified one that 
> includes the NVRAM in a single file. In Arch Linux, for the Xen 4.4 package, 
> the maintainer decided that the way to go for including OVMF support to Xen 
> was to use --enable-ovmf, cause at least it was possible to get it working 
> with some available patches. However, for both download and build times, it 
> would be better to simply distribute a working binary. Any ideas of why 
> --with-system-ovmf didn't worked for us?
> 2b - On successful Xen builds with OVMF support, something which I looked for 
> is the actual ovmf.bin file. So far, the only thing which I noticed is that 
> the hvmloader is 2 MiB bigger that on non-OVMF builds. Is there any reason 
> why OVMF is build into the hvmloader instead of what happens to the other 
> Firmware binaries, which are usually sitting in a directory as standalone 
> files?

(answering 2a and 2b together)

ovmf is currently unconditionally compiled into hvmloader, which is why
it gets 2MB bigger.  I believe --with-system-ovmf= (and
-with-system-seabios for that matter) needs the system ovmf available in
the build environment to be linked into hvmloader.

For a separate project, I have a usecase for hvmloader itself being a
multiboot image.  This would allow the use of multiboot modules, which
would be far more flexible than compiling all the binaries into
hvmloader itself.  In particular, when a system qemu updates its system
seabios/ovmf, hvmloader could use the updated bioses rather than the
linked bioses.

> 2c - Something which I'm aware is that an OVMF binary can be in two formats: 
> A unified binary that has both OVMF and NVRAM, or a OVMF binary with a 
> separate NVRAM (1.87 MiB + 128 KiB respectively). According to what I read 
> about using OVMF with QEMU, it seems that if using a unified binary, you need 
> one per VM, cause the NVRAM content is different. I suppose than with the 
> second option you have one OVMF Firmware binary and a 128 KB NVRAM per UEFI 
> VM. How does Xen handles this? If I recall correctly, I heared than it is 
> currently volatile (NVRAM contents aren't saved on DomU shutdown).

Currently nothing is saved.  With mutliboot modules and in particular,
separate multiboot modules for the main OVMF binary and a small nvram,
it would be possible to specify "nvram = /path/to/nvram.bin" in your
vm.cfg and gain proper nvram which persists across reboot.

> 2d - Is there any recorded experience or info regarding how a UEFI DomU would 
> behave with something like, say, Windows 8 with Fast Boot, or other UEFI 
> features for native systems? This is pretty much a "what if..." scenario than 
> something that I could really use.

I believe Anthony has managed to get this working with a Xenified OVMF?

>    PCI/VGA Passthrough
> It was four years ago when I learned about IOMMU virtualization making 
> possible gaming in a VM via VGA Passthrough (First time I heared about that 
> was with some of Teo En Ming videos on Youtube), something which was quite 
> experimental back at that time. Even currently, the only other Hypervisor or 
> VMM that can compete with Xen in this area is QEMU with KVM VFIO, which also 
> has decent VGA Passthrough capabilities. While I'm aware that Xen is pretty 
> much enterprise oriented, it was also the first to allow a power user to make 
> a system based on Xen as Hypervisor and everything else virtualized, getting 
> nearly all the functionality of running native with the flexibility than 
> virtualization offers, at the cost of some overhead, quircks and complexity 
> on usage. Its a pain to configure it the first time, but if you manage to get 
> it working, its wonderful. So far, this feature has created a small niche of 
> power users that uses either Xen or QEMU KVM VFIO for virtualized gaming, and 
> I consider VGA Passthrough a quite major feature because it is what allows 
> such setups on the first place.

I wouldn’t necessarily say that Xen is specifically enterprise
orientated.  However, Xen is certainly harder to set up and use than
alternatives, which does raise the bar to start using it.

> 3a - On some of the Threads of the original guides I read about how to use 
> Xen to do VGA Passthrough, you usually see the author and others users saying 
> that they didn't manage to get VGA Passthrough working on newer versions. 
> This usually affected people that was doing the migration from the xm to xl 
> toolstack, but also between some Xen versions (I reported a regression on Xen 
> 4.4 vs a fully working 4.3). Passthrough compatibility previously used to be 
> a Hardware-related pain cause it was extremely Motherboard and BIOS dependant 
> on an era where consumer Motherboards makers didn't paid attention to the 
> IOMMU, but at least on the Intel Haswell platform support for IOMMU is 
> starting to get more mainstream.
> Considering than PCI/VGA Passthrough compatibility with a system or 
> regressions of it between Xen versions is pretty much a hit-or-miss, would it 
> be possible to do something to get this feature under control? It seems like 
> this isn't deeply tested, or at least not with too many variables involved 
> (Hard to do, cause they're A LOT). I believe that it should be possible to 
> have a few systems at hand which are know to work and representative of a 
> Hardware platform tested against regression with different Video Cards, but 
> it sounds extremely time consuming to switch cards, reboot, test with 
> different DomUs OSes/Drivers, etc. At the moment, once you get a 
> Computer/Distribution/Kernel/Xen/Toolstack/DomU OS/Drivers combination that 
> works, you simply stick to it, so many early adopters of VGA Passthrough are 
> still using extremely outdated versions. Even worse, for users of versions 
> like 4.2 with xm, if they want to upgrade to 4.4 with xl and want to figure 
> out why it doesn't work, it will be a royal pain in the butt to figure out 
> what patch was introduced that breaks compatibility for them, so those early 
> adopters are pretty much out of luck if they have to go through years worth 
> of code and version testing.

PCI Passthrough is in an awkward position.  I am not aware of any
dedicates testing that the stable/master branches get, and it is
surprisingly difficult to automate.  It would certainly be nice for
passthrough to get some form of dedicated testing, but currently the
best we have is users like yourself complaining when it breaks.  This is
certainly a situation which needs improving.

In XenServer, we support passthrough in a very restricted set of
circumstances, because there are simply too many system quirks (that we
know about, let alone those we don't) for us to be comfortable
supporting it in general.  Furthermore, our testing only covers the
version of Xen we are using in trunk, which is generally the latest stable.

> 3b - Do someone knows what is the actual difference on Intel platforms 
> regarding VT-d support? As far that I know, the VT-d specification allows for 
> multiple "DMA Remapping Engines", of which a Haswell Processor has two, one 
> for its Integrated PCIe Controller and another for the Integrated GPU. You 
> also have Chipsets, some of which according to Intel Ark support VT-d (Which 
> I believe should be in the form of a third DMA Remapping Engine), like the 
> Q87 and C226, and those that don't like the H87 and Z87. Based on working 
> samples I have been lead to believe than a Processor supporting VT-d will 
> provide the IOMMU capabilities for the devices connected to its own PCIe 
> Slots regardless of what Chipset you're using (That's the reason why you can 
> do Passthrough with only Processor VT-d support), I would believe the same 
> holds true with a VT-d Chipset with a non VT-d Processor, through I didn't 
> saw any working example of this.
> When I was researching about this one year ago, Supermicro support said this 
> to me:
> Since Z87 chipset does not support VT-d,  onboard LAN will not support it 
> either because it is connected to PCH PCIe port.  One workaround is to use a 
> VT-d enabled PCIe device and plug it into CPU based PCIe-port on board.  
> Along with a VT-d enabled CPU the above workaround should work per Intel.
> Based on this, there should be a not-very-well-documented quirck. The most 
> common configuration for VGA Passthrough users is a VT-d supporting Processor 
> with a consumer Motherboard, so basically, if you have a VT-d supporting 
> Processor like a Core i7 4790K, you can do Passthrough of the devices 
> connected to the Processor PCIe Slots, and also of the ones connected to the 
> Chipset if you apply that workaround (I don't know what does "VT-d enabled 
> PCIe device" means exactly). I recall seeing some people using VMWare ESXi 
> commenting that they couldn't passthrough the integrated NIC even through 
> some a RAID Controller connected to the Processor could in such setups. Don't 
> have link at hand about the matter, but I believe that reelevant for the 
> question.
> Considering that if workarounded you would be using the Processor DMA 
> Remapping Engine for Chipset devices, is there any potential bottleneck or 
> performance degradation there?

The only reasonable interpretation that stands a chance of working is a
PCIe device with an IOMMU on it, but I am not aware of any such device,
or whether it would actually work.

It is certainly possible to have more than one IOMMU.  Servers typically
have one per socket and one for the chipset.  This doesn't necessarily
mean that all devices are covered by IOMMUs.

> 3c - There is a feature that enhances VT-d called ACS (Access Control 
> Service), related to IOMMU groups isolation. This feature seems to be 
> excluded from consumer platforms, and support for it seems to already be on 
> Xen wishlist based on comments. Info here:
> vfio.blogspot.com.ar/2014/08/iommu-groups-inside-and-out.html
> comments.gmane.org/gmane.comp.emulators.xen.devel/212561

ACS is required to fix issues caused by optimisation permitted under the
PCIe spec, which are invalid in combination with IOMMU.  The main one is
peer-to-peer DMA which permits a switch to complete peer-to-peer traffic
without forwarding it upstream.  This is wrong between two devices with
different IOMMU mappings, and ACS provides an override to say "forward
everything upstream - the IOMMU will make it go in different directions".

Presence or lack of ACS certainly does affect whether devices behind a
PCIe switch can safely be isolated into different IOMMU contexts.

> 3d - The new SATA Express and M.2 connectors combines SATA and some PCI 
> Express lanes on the same connector. Depending on implementation, the PCI 
> Express lanes could come from either the Chipset or the Processor. 
> Considering than some people likes to passthrough the entire SATA Controller, 
> how does it interacts with this frankenstein connector with the PCIe lanes 
> coming from elsewhere? I'm curious.

No idea, but I suspect it would appear as a different device, separate
to the SATA controller.

>    Miscelaneous Virtualization stuff
> 4a - There are several instances where the Software is trying to check if it 
> is under a virtualized enviroment or not. Examples which I recall having read 
> about are some malware, which tries to hide if it detects that it is running 
> virtualized (Cause it means that it is not your exploitable Average Joe 
> computer), or according some comments I read, some Drivers like those of 
> NVIDIA to force you to use a Quadro for VGA Passthrough instead of a consumer 
> based GeForce. Is the goal of virtualization to reproduce the exact 
> behaviator in a VM of a system running native, or just be functionally 
> equivalent? This is because as more Software appears that makes a distinction 
> between native and a VM, it seems that in the end it will be forcing VMs to 
> look and behave like a native system to maintain compatibility. While 
> currently such Software is pretty much a specific niche, it exist the 
> possibility than it becomes a trend with the growing popularity of the cloud.
> For example, one of the things that pretty much tells the whole history, is 
> the 440FX Chipset, because if you see that Chipset running anything but a P5 
> Pentium, you know you're running either emulated or virtualized. Also, if I 
> use an application like CPU-Z, it says than the BIOS Brand is Xen, Version 
> 4.3.3, which makes the status of the system as inside a VM also obvious. I 
> think that based on the rare but existant Software pieces that attempts to 
> check if its running on a VM or not to decide behavior, at some point in time 
> a part of the virtualization segment will be playing a catching up game to 
> mask being a VM from these types of applications. I suppose that a possible 
> endgame for this topic would be where you have a VM that tries to represent 
> accurately as possible the PCI Layout of a commercial Chipset (Which I 
> believe was one of the aims of QEMU Q35 emulation), and deliberately lying 
> and/or masking the Processor CPUID data, BIOS vendor, and other recognizable 
> things, to try to match what you would expect from a native system of that 
> Hardware generation.
> This point could be questionable, cause making a perfect VM that is 
> indistinguishable from a native system could harm some vendors that may rely 
> on identifying if its running on a VM or not for enforcing licensing and the 
> like.

I would go so far as to say that the majority of people using
virtualisation want something which works (for varying definitions of
'works'), and is as fast as possible.  Making an HVM guest
indistinguishable from a real computer is a very difficult task, and one
which I don't believe is practical to achieve.  An OS which is really
trying to identify a virtualised environment can even make a guess by
timing certain operations which would vmexit for emulation purposes.

> 4b - The only feature which I feel that Xen is missing from a home user 
> perspective, is sound. As far that I know you can currently tell QEMU to 
> emulate a Sound Card in a DomU, but there is no way to easily get the sound 
> out of a DomU like other VMMs do. Some of the solutions I saw relied on 
> either multiple passthroughed Sound Cards, or a PulseAudio Server adding 
> massive sound latency. While Xen is enterprise oriented where sound is 
> unneeded, I recall hearing that this feature was getting considered, but 
> didn't see any mention about it for months. How hard or complex it would be 
> to add sound support to Xen? Is the way to do it decided? Could it take the 
> form of using Dom0 Drivers for the Sound Card to act as a mixer and some PV 
> Drivers for the DomU like the ones currently available for the NIC and 
> storage?

Sorry, I don't have any useful input here, other than "that would be nice".


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.