[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen



On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> On 02/02/16 17:11, Stefano Stabellini wrote:
> > Haozhong, thanks for your work!
> > 
> > On Mon, 1 Feb 2016, Haozhong Zhang wrote:
> > > 3.2 Address Mapping
> > > 
> > > 3.2.1 My Design
> > > 
> > >  The overview of this design is shown in the following figure.
> > > 
> > >                  Dom0                         |               DomU
> > >                                               |
> > >                                               |
> > >  QEMU                                         |
> > >      +...+--------------------+...+-----+     |
> > >   VA |   | Label Storage Area |   | buf |     |
> > >      +...+--------------------+...+-----+     |
> > >                      ^            ^     ^     |
> > >                      |            |     |     |
> > >                      V            |     |     |
> > >      +-------+   +-------+        mmap(2)     |
> > >      | vACPI |   | v_DSM |        |     |     |        +----+------------+
> > >      +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
> > >          ^           ^     +------+     |     |        +----+------------+
> > >  --------|-----------|-----|------------|--   |             ^            ^
> > >          |           |     |            |     |             |            |
> > >          |    +------+     +------------~-----~-------------+            |
> > >          |    |            |            |     |        
> > > XEN_DOMCTL_memory_mapping
> > >          |    |            |            +-----~--------------------------+
> > >          |    |            |            |     |
> > >          |    |       +----+------------+     |
> > >  Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
> > >          |    |       +----+------------+     |     | ACPI |   | _DSM |
> > >          |    |                   ^           |     +------+   +------+
> > >          |    |                   |           |         |          |
> > >          |    |               Dom0 Driver     |   hvmloader/xl     |
> > >  
> > > --------|----|-------------------|---------------------|----------|---------------
> > >          |    +-------------------~---------------------~----------+
> > >  Xen     |                        |                     |
> > >          +------------------------~---------------------+
> > >  
> > > ---------------------------------|------------------------------------------------
> > >                                   +----------------+
> > >                                                    |
> > >                                             +-------------+
> > >  HW                                         |    NVDIMM   |
> > >                                             +-------------+
> > > 
> > > 
> > >  This design treats host NVDIMM devices as ordinary MMIO devices:
> > >  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
> > >      and drive host NVDIMM devices (implementing block device
> > >      interface). Namespaces and file systems on host NVDIMM devices
> > >      are handled by Dom0 Linux as well.
> > > 
> > >  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
> > >      virtual address space (buf).
> > > 
> > >  (3) QEMU gets the host physical address of buf, i.e. the host system
> > >      physical address that is occupied by /dev/pmem0, and calls Xen
> > >      hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> > 
> > How is this going to work from a security perspective? Is it going to
> > require running QEMU as root in Dom0, which will prevent NVDIMM from
> > working by default on Xen? If so, what's the plan?
> >
> 
> Oh, I forgot to address the non-root qemu issues in this design ...
> 
> The default user:group of /dev/pmem0 is root:disk, and its permission
> is rw-rw----. We could lift the others permission to rw, so that
> non-root QEMU can mmap /dev/pmem0. But it looks too risky.

Yep, too risky.


> Or, we can make a file system on /dev/pmem0, create files on it, set
> the owner of those files to xen-qemuuser-domid$domid, and then pass
> those files to QEMU. In this way, non-root QEMU should be able to
> mmap those files.

Maybe that would work. Worth adding it to the design, I would like to
read more details on it.

Also note that QEMU initially runs as root but drops privileges to
xen-qemuuser-domid$domid before the guest is started. Initially QEMU
*could* mmap /dev/pmem0 while is still running as root, but then it
wouldn't work for any devices that need to be mmap'ed at run time
(hotplug scenario).


> > >  (ACPI part is described in Section 3.3 later)
> > > 
> > >  Above (1)(2) have already been done in current QEMU. Only (3) is
> > >  needed to implement in QEMU. No change is needed in Xen for address
> > >  mapping in this design.
> > > 
> > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > >        get the physical address from a virtual address.
> > >        /proc/<qemu_pid>/pagemap provides information of mapping from
> > >        VA to PA. Is it an acceptable solution to let QEMU parse this
> > >        file to get the physical address?
> > 
> > Does it work in a non-root scenario?
> >
> 
> Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
> | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> | Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
>
> A possible alternative is to add a new hypercall similar to
> XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> parameter and translating to machine address in the hypervisor.

That might work.


> > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > >        occupied by pmem at the beginning, i.e. QEMU may not be able to
> > >        get all SPA of pmem from buf (in virtual address space) when
> > >        calling XEN_DOMCTL_memory_mapping.
> > >        Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > >        entire pmem being mmaped?
> > 
> > Ditto
> >
> 
> No. If I take the above alternative for the first open, maybe the new
> hypercall above can inject page faults into dom0 for the unmapped
> virtual address so as to enforce dom0 Linux to create the page
> mapping.

Otherwise you need to use something like the mapcache in QEMU
(xen-mapcache.c), which admittedly, given its complexity, would be best
to avoid.


> > > 3.2.2 Alternative Design
> > > 
> > >  Jan Beulich's comments [7] on my question "why must pmem resource
> > >  management and partition be done in hypervisor":
> > >  | Because that's where memory management belongs. And PMEM,
> > >  | other than PBLK, is just another form of RAM.
> > >  | ...
> > >  | The main issue is that this would imo be a layering violation
> > > 
> > >  George Dunlap's comments [8]:
> > >  | This is not the case for PMEM.  The whole point of PMEM (correct me if
> > >    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
> > >  | I'm wrong) is to be used for long-term storage that survives over
> > >  | reboot.  It matters very much that a guest be given the same PRAM
> > >  | after the host is rebooted that it was given before.  It doesn't make
> > >  | any sense to manage it the way Xen currently manages RAM (i.e., that
> > >  | you request a page and get whatever Xen happens to give you).
> > >  |
> > >  | So if Xen is going to use PMEM, it will have to invent an entirely new
> > >  | interface for guests, and it will have to keep track of those
> > >  | resources across host reboots.  In other words, it will have to
> > >  | duplicate all the work that Linux already does.  What do we gain from
> > >  | that duplication?  Why not just leverage what's already implemented in
> > >  | dom0?
> > >  and [9]:
> > >  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
> > >  | then you're right -- it is just another form of RAM, that should be
> > >  | treated no differently than say, lowmem: a fungible resource that can 
> > > be
> > >  | requested by setting a flag.
> > > 
> > >  However, pmem is used more as persistent storage than fungible ram,
> > >  and my design is for the former usage. I would like to leave the
> > >  detection, driver and partition (either through namespace or file
> > >  systems) of NVDIMM in Dom0 Linux kernel.
> > > 
> > >  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
> > >  check for the physical address and size passed from caller
> > >  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
> > >  aware of the SPA range of pmem so that it can refuse map physical
> > >  address in neither the normal ram nor pmem.
> > 
> > Indeed
> > 
> > 
> [...]
> > > 
> > > 
> > > 3.3 Guest ACPI Emulation
> > > 
> > > 3.3.1 My Design
> > > 
> > >  Guest ACPI emulation is composed of two parts: building guest NFIT
> > >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> > >  emulating guest _DSM.
> > > 
> > >  (1) Building Guest ACPI Tables
> > > 
> > >   This design reuses and extends hvmloader's existing mechanism that
> > >   loads passthrough ACPI tables from binary files to load NFIT and
> > >   SSDT tables built by QEMU:
> > >   1) Because the current QEMU does not building any ACPI tables when
> > >      it runs as the Xen device model, this design needs to patch QEMU
> > >      to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> > > 
> > >   2) QEMU copies NFIT and SSDT to the end of guest memory below
> > >      4G. The guest address and size of those tables are written into
> > >      xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> > > 
> > >   3) hvmloader is patched to probe and load device model passthrough
> > >      ACPI tables from above xenstore keys. The detected ACPI tables
> > >      are then appended to the end of existing guest ACPI tables just
> > >      like what current construct_passthrough_tables() does.
> > > 
> > >   Reasons for this design are listed below:
> > >   - NFIT and SSDT in question are quite self-contained, i.e. they do
> > >     not refer to other ACPI tables and not conflict with existing
> > >     guest ACPI tables in Xen. Therefore, it is safe to copy them from
> > >     QEMU and append to existing guest ACPI tables.
> > > 
> > >   - A primary portion of current and future vNVDIMM implementation is
> > >     about building ACPI tables. And this design also leave the
> > >     emulation of _DSM to QEMU which needs to keep consistency with
> > >     NFIT and SSDT itself builds. Therefore, reusing NFIT and SSDT from
> > >     QEMU can ease the maintenance.
> > > 
> > >   - Anthony's work to pass ACPI tables from the toolstack to hvmloader
> > >     does not move building SSDT (and NFIT) to toolstack, so this
> > >     design can still put them in hvmloader.
> > 
> > If we start asking QEMU to build ACPI tables, why should we stop at NFIT
> > and SSDT?
> 
> for easing my development of supporting vNVDIMM in Xen ... I mean
> NFIT and SSDT are the only two tables needed for this purpose and I'm
> afraid to break exiting guests if I completely switch to QEMU for
> guest ACPI tables.

I realize that my words have been a bit confusing. Not /all/ ACPI
tables, just all the tables regarding devices for which QEMU is in
charge (the PCI bus and all devices behind it). Anything related to cpus
and memory (FADT, MADT, etc) would still be left to hvmloader.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.