Hi all,
these are notes for various design discussions wrapped into one. I will also link these from
https://docs.google.com/document/d/1fWQMuiblTmiNkWGGNbz20AQtbbnvtohxnUjhX2pA2jU/edit and
https://wiki.xenproject.org/wiki/Design_Sessions
== NVDIMM ==
In the attached Virtual NVDIMM Discussion note_Xensubmit.pdf – in text form
Virtual NVDIMM Status in XEN
## Discussion I
Obstacles to expose NVDIMM to Dom
George :Added a Xen specific label:
Todo: Init/active/map/unmapnvdimmpmemaddress space.
We will build the frame table when active/promote the pmemto DomU.
Trust that dom0 has unmapped before active.
George: Where to put the frame table?
Option a: A super block with single namespace?
Option b: Inside each namespace?
## Discussion II
Xen Dom0 doesn’t have huge page support.
Yu : Size of NVDIMM is tremendous, paging structures mapping to the NVDIMM and even
to the struct page will occupy huge RAM size, access latency of NVDIMM is apparently
higher than DRAM, these paging structures should be put in the DRAM which is much
smaller than NVDIMM. Driver team solved this, by using huge page in DRAM to map
NVDIMM and to map management information like struct page. But in Xen, this cannot be
achieved without drastic changes in the PV MMU logic, which has no PSE exposed to
dom0.
George: We could try to use PVHDom0 instead.
## Discussion III
fsdaxvs devdax:
Yi: A file in the fs_daxshould be easy management, but the dev_daxwill have a better
performance than that. also we have a fs rearrangement issue while using fs_daxfile as
backend.
Andrew: for fs rearrangement issue, we can try the llvmon the dax.
== Processor Trace ==
Background slides: see
https://www.slideshare.net/xen_com_mgr/xpdds18-eptbased-subpage-write-protection-on-xenc-yi-zhang-intel
> We don't need to pass through the Intel PT MSRs at any time. All the MSRs read/write
can be trapped.
> Implement "SYSTEM" mode in XEN hypervisor:
Andrew: Add some new PV interface in Dom0 to set Intel PT buffer for XEN
hypervisor. Dom0 can be trusted to setupcorrectly; PV domU cannot be trusted;
> Have potential risk in nested (use system mode in L1 guest with EPT on EPT)
Andrew: Feel free to ignore nested support.
> About Introspection:
Andrew: For the 1st version, we can ignore VM introspection. But we need fabricate PIP for cr3,
FUP for interrupts, Packet generation for enabling
& disabling, mode packets.
> VM-exit due to Intel PT output:
Illegal PT output buffer address from guest:
Andrew: cannot prevent guest to setup illegal setup; detect in EPT violation; => OK to crash guest;
Set a MMIO address in guest as Intel PT buffer:
Andrew: OK to crash guest;
Page with write protect:
Andrew: more complicated; basically entirely reasonable we can still crash guest; but should need to know the reason, and distinguish case by case;
> Intel PT have some sub-features and these features are different in different hardware platform.
Can we expose all sub-features to guest?
George: make sure common PT sub-features are used;
Andrew: Yes we want to but we don’t need to expose all by default (may break live migration);
Andrew: Acceptable that in first version we don’t turn on PT VMX for guest by default, but turn on when
needed in guest, and then prevent live migration for that guest;
Andrew: current toolstack doesn’t have mechanism to prevent wrong migration between cpu feature sets; future implementation should provide such to
detect whether live migration can be done;
> Add more description about this feature in next path set version.
Andrew/Lars: We need to add a patch in next version to describe: what is intel PT, how to use this feature,
some limitations at present and so on.
== More vCPUs in a HVM guest ==
About multiple IOREQ pages support:
Paul: ioreq_t is a stable ABI and hence qemu can calculate how many IOREQ pages needed for a given number of vcpu.
When to switch to x2APIC mode from xAPIC mode
Andrew: hvmloader isn’t a good place. We always have a misunderstanding that hvmloader is the virtual BIOS. Actually it isn’t. We shouldn’t rely on a program running in the guest to do this switch.
Chao: Roger suggested doing this switch in a common path shared by PVH and HVM. Is the handler of xc_domain_max_vcpus() a good place?
May face some problems like CPUID policy isn’t set to Xen at that point. Switch to x2APIC mode when applying CPUID policy.
Other suggestions:
Andrew: 288 vcpu has naming issue because more vcpus can also be supported. Andrew: may change the title to support more than 255 vcpus.
George & Andrew: need some change in scheduler. If there are multiple big VMs, we may use the virtual core as an entity in scheduler to achieve the best performance.
Andrew: this series should be divided to various small independent series.
Just focus on vIOMMU part first, and split others which can be separated.