[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
disclaimer: This email got a bit lengthy - so make sure you got a cup of coffee when you read this. > On an unrelated note I think if we do go down the route of having the > guest kernel punch the holes itself and such we should do so iff > XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open When would that actually happen? Is that return value returned when the hypervisor is not implementing it (what version was that implemented this)? > the possibility of cunning tricks on the tools side in the future. <shuders> I think we have three options in regards to this RFC patch I posted: 1). Continue with this and have the toolstack punch the PCI hole. It would fill the PCI hole area with INVALID_MFN. The toolstack determines where the PCI hole starts. 2). Do this in the guest where the guest calls both XENMEM_machine_memory_map and XENMEM_memory_map to get an idea of the host side PCI hole and set it up. Requires changes in hypervisor to allow non-privileged PV guest to make XENMEM_machine_memory_map call. Linux kernel decides where PCI hole starts and the PCI hole is filled with INVALID_MFN. 3). Make unconditionally a PCI hole, starting at 3GB. PCI hole filled with INVALID_MFN. 4). Another one I didn't think of? For all of those cases when devices show up we populate on demand the P2M array with the MFNs). For the first two proposals the BARs we read of the PCI devices are going to be written to the P2M array as identity (so mfn_list[0xc0000] == 0xc0000). Code has not been written. For the third proposal, we would have non-identity mappings in the P2M array, as during the migration we could move from a device with BARs of 0xc0000 to 0x20000. So mfn_list[0xc0000] = 0x20000. But for the third case I am unsure how we would get the "real" MFNs. We initially get the BARs via 0xcf8 calls and if we don't filter them, it gets to ioremap function. Say the host side BAR is at 0x20000, and our PCI hole starts at 0xc0000. The ioremap gets called with 0x20000, and in its E820 that region is 'System RAM'. last_pfn = last_addr >> PAGE_SHIFT; for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) { int is_ram = page_is_ram(pfn); if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn))) return NULL; WARN_ON_ONCE(is_ram); } Ugh, and it will think (correctly) that it falls within RAM. If we filter the 0xcf8 calls, which we can do the Xen PCI backend case, we can then provide BARs that always start at 0xC0000. But that does not help the PV guest to know the "real" MFNs which it needs so it can program the P2M array. So the Xen PCI front would have to do this - which it could, thought it adds a complexity to it. We also need to make all of this works with Domain zero, and here 1) or 2) can easily be used as the Xen hypervisor has given us the E820 nicely peppered with holes. (I wonder, what happens if dom0 makes a XENMEM_memory_map call - does it get anything?) If we then go with 3), we would need to instrument the code that reads the BARs so that it can filter it properly. That would be low-level Linux pci_conf_read and that is not going happen - so we would have to make the Xen hypervisor be aware of this and when it traps the in/out provide new BAR values starting at 0xC0000. I am not comfortable maintaining this filter/keep state code in both the Xen hypervisor and the Xen PCI front module so I think 3) would not work that well, unless there are better ways that I have missed? Back to 1) and 2). Migration would work if we unplug the PCI devices before suspend and on resume plug them back in - otherwise the PCI BARs might have changed between migrations. When the guest gets recreated - how does it iterate over the E820 to create the P2M list? Or is that something that is not done and we just save the P2M list and restore as-is on the other side? Naturally, since we would unplug the PCI device the entries in the E820 gaps would be INVALID_MFN... If we consult the E820 during resume I think doing the PCI hole in the toolstack has merits - simply b/c the user can set the PCI hole to an arbitrary address that is low enough (0x2000 say) to cover all of the machines that he/she would migrate too. While if we do it in the Linux kernel we do not have that information. Even if we don't consult the E820, the toolstack still has merits - as the PCI hole start address might be different between the migration machines and we might have started on a box with the PCI hole being way up (3.9GB) while the other machines might have at 1.2GB. The other thing I don't know is how all of this works with 32-bit kernels? P.S. I've done the testing of 1) with 64-bit w/ and w/o ballooning and it worked fine. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |