[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-merge] FW: vmware's virtual machine interface
Folks, there's been some disucssion about the VMI interface proposal between myself and Linus/Andrew. I've appeneded my latest reply. As regards the VMI proposal itself, I don't think I can forward it, so if you don't have it you'd better ask Pratap Subrahmanyam [pratap@xxxxxxxxxx] for it directly. Cheers, Ian -----Original Message----- From: Ian Pratt [mailto:m+Ian.Pratt@xxxxxxxxxxxx] Sent: 08 August 2005 20:59 To: Andrew Morton; torvalds@xxxxxxxx Cc: ian.pratt@xxxxxxxxxxxx Subject: RE: vmware's virtual machine interface > Ian, the vmware guys were sounding a little anxious that they hadn't > heard anything back on the VMI proposal and spec? The first few of their patches are fine -- just cleanups to existing arch code that we have similar patches for in our own tree. However, our views on the actual VMI interface haven't changed since the discussion at OLS, and we have serious reservations about the proposal. I believe being able to override bits of kernel code and divert execution through a "ROM" image supplied by the hypervisor is going to lead to a maintenance nightmare. People making changes to the kernel won't be able to see what the ROM code is doing, and hence won't know how their changes effect it. There'll be pressure to freeze internal APIs, otherwise it will be a struggle to keep the 'ROM' up to date. I suspect we'll also end up with a proliferation of hook points that no-one knows whether they're actually used or not (there are currently 86). There'll also be pressure to allocate opaque VMI private data areas in various structures such as struct mm and struct page. Looking at the VMI hooks themselves, I don't think they've really thought through the design, at least not for a high-performance implementation. For example, they have an API for doing batched updates to PTEs. The problem with this approach is that it's difficult to avoid read-after-write hazards on queued PTE updates -- you need to sprinkle flushes liberally throughout arch independent code. Working out where to put the flushes is tough: Xen 1.0 used this approach and we were never quite sure we had flushes in all the necessary places in Linux 2.4 -- that's why we abandoned the approach with Xen 2.0 and provided a new interface that avoids the problem entirely (and is also required for doing fast atomic updates which are essential to make SMP guests get good performance). The current VMI design is mostly looking at things at an instruction level, providing hooks for all the privileged instructions plus some for PTE handling. Xen's ABI is a bit different. We discovered that is wasn't worth creating hooks for many of the privileged instructions since they're so infrequently executed that you might as well take the trap and decode and emulate the instruction. The only ones that matter are on critical paths (such as the context switch path, demand fault, IPI, interrupt, fork, exec, exit etc), and we've concentrated our efforts on making these paths go fast, driven by performance data. As it stands, the VMI design wouldn't support several of the optimizations that we've found to be very important for getting near-native performance. The VMI design assumes you're using shadow page tables, but a substantial part of Xen's performance comes from avoiding their use. There's also no mention of SMP. This has been one of the trickiest bits to get right on Xen -- it's essential to be able to support SMP guests with very low overhead, and this required a few small but carefully placed changes to the way IPIs and memory management are handled (some of which have benefits on native too). The API doesn't address IO virtualization at all. We tend to think of the hypervisor API like a hardware architecture. It's fairly fixed but can be extended from time to time in a backward compatible fashion (after considerable thought and examination of benchmark data, just as happens for h/w CPUs). The core parts of the Xen CPU API have been fixed for quite a while (there have been some changes to the para-virtualized IO driver APIs, but these are not addressed by VMI at all). One attractive aspect of the VMI approach is that it's possible to have one kernel that works on native (at reduced performance) or on potentially multiple hypervisors. However, the real cost to linux distros and ISVs of having multiple linux kernels is the fact that they need to do all the s/w qualification on each one. The VMI approach doesn't change this at all: they will still have to do qualification tests on native, Xen, VMware etc just as they do today[*]. Although it would be nice to be able to move a running kernel between different hypervisors at run time I really can't see how VMI would make this feasible. There's far too much hidden state in the ROM and hypervisor itself. At an implementation level their design could be improved. Using function pointers to provide hook points causes unnecessary overhead -- it's better to insert 5 byte NOPs that can be easily patched. In summary: the cleanup part of their patch is useful, but I think VMI "ROM" approach is going to be messy and very troublesome to get right. Chris Wright, Martin Bligh et al are currently make good progress refactoring the xen patch to get it into a form that should be more palatable. [See http://lists.xensource.com/archives/html/xen-merge/ ] It wouldn't be a big deal to add VMI-like hooks to the Xen sub arch if VMware want to go down that route (though we'd prefer to do it with NOP padding rather than by adding an unnecessary indirection). Cheers, Ian [*]Having a single kernel image that works native and on a hypervisor is quite convenient from a user POV. We've looked into addressing this problem in a different way, by building multiple kernels and then using a tool that does a function-by-function 'union' operation, merging the duplicates and creating a re-write table that can be used to patch the kernel from native to Xen. This approach has no run time overhead, and is entirely 'mechanical' rather than having to having to do it as source level that can be both tricky and messy. _______________________________________________ Xen-merge mailing list Xen-merge@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-merge
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |