[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PVH CPU hotplug design document



On Thu, Jan 12, 2017 at 07:00:57PM +0000, Andrew Cooper wrote:
> On 12/01/17 12:13, Roger Pau Monné wrote:
[...]
> > ## QEMU CPU hotplug using ACPI
> >
> > The ACPI tables provided to HVM guests contain processor objects, as 
> > created by
> > libacpi. The number of processor objects in the ACPI namespace matches the
> > maximum number of processors supported by HVM guests (up to 128 at the time 
> > of
> > writing). Processors currently disabled are marked as so in the MADT and in
> > their \_MAT and \_STA methods.
> >
> > A PRST operation region in I/O space is also defined, with a size of 
> > 128bits,
> > that's used as a bitmap of enabled vCPUs on the system. A PRSC method is
> > provided in order to check for updates to the PRST region and trigger
> > notifications on the affected processor objects. The execution of the PRSC
> > method is done by a GPE event. Then OSPM checks the value returned by \_STA 
> > for
> > the ACPI\_STA\_DEVICE\_PRESENT flag in order to check if the vCPU has been
> > enabled.
> 
> It is worth describing the toolstack side of hotplug? It is equally
> relevant IMO.

By toolstack I assume you mean the hypercalls or xenstore writes that are
performed in order to notify QEMU or Xen of new vCPUs?

I haven't looked much into this, but I guess Boris could fill up some of the
details.

> >
> > ## Native CPU hotplug
> >
> > OSPM waits for a notification from ACPI on the processor object and when an
> > event is received the return value from _STA is checked in order to see if
> > ACPI\_STA\_DEVICE\_PRESENT has been enabled. This notification is triggered
> > from the method of a GPE block.
> >
> > # PVH CPU hotplug
> >
> > The aim as stated in the introduction is to use a method as similar as 
> > possible
> > to bare metal CPU hotplug for PVH, this is feasible for unprivileged 
> > domains,
> > since the ACPI tables can be created by the toolstack and provided to the
> > guest. Then a minimal I/O or memory handler will be added to Xen in order to
> > report the bitmap of enabled vCPUs. There's already a [series][0] posted to
> > xen-devel that implement this functionality for unprivileged PVH guests.
> >
> > This however is proven to be quite difficult to implement for the hardware
> > domain, since it has to manage both pCPUs and vCPUs. The hardware domain 
> > should
> > be able to notify Xen of the addition of new pCPUs, so that they can be 
> > used by
> > the Hypervisor, and also be able to hotplug new vCPUs for it's own usage. 
> > Since
> > Xen cannot access the dynamic (AML) ACPI tables, because it lacks an AML
> > parser, it is the duty of the hardware domain to parse those tables and 
> > notify
> > Xen of relevant events.
> >
> > There are several related issues here that prevent a straightforward 
> > solution
> > to this issue:
> >
> >  * Xen cannot parse AML tables, and thus cannot get notifications from ACPI
> >    events. And even in the case that Xen could parse those tables, there can
> >    only be one OSPM registered with ACPI
> 
> There can indeed only be one OSPM, which is the entity that executes AML
> methods and receives external interrupts from ACPI-related things.
> 
> However, dom0 being OSPM does not prohibit Xen from reading and parsing
> the AML (should we choose to include that functionality in the
> hypervisor).  Xen is fine to do anything it wants in terms of reading
> and interpreting the tables, so long as it doesn't start executing AML
> bytecode.

I would like to see this too, since it would allow Xen to see the CPU power
states and shutdown the hardware without using the complicate mess that we
currently have in order to perform ACPI shutdown.

> >  * Xen can provide a valid MADT table to the hardware domain that describes 
> > the
> >    environment in which the hardware domain is running, but it cannot 
> > prevent
> >    the hardware domain from seeing the real processor devices in the ACPI
> >    namespace, neither Xen can provide the hardware domain with processor
> 
> ", nor can Xen provide the..."
> 
> >    devices that match the vCPUs at the moment.
> >
> > [0]: 
> > https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg00060.html
> >
> > ## Proposed solution using the STAO
> >
> > The general idea of this method is to use the STAO in order to hide the 
> > pCPUs
> > from the hardware domain, and provide processor objects for vCPUs in an 
> > extra
> > SSDT table.
> >
> > This method requires one change to the STAO, in order to be able to notify 
> > the
> > hardware domain of which processors found in ACPI tables are pCPUs. The
> > description of the new STAO field is as follows:
> >
> >  |   Field            | Byte Length | Byte Offset |     Description         
> >  |
> >  
> > |--------------------|:-----------:|:-----------:|--------------------------|
> >  | Processor List [n] |      -      |      -      | A list of ACPI numbers, 
> >  |
> >  |                    |             |             | where each number is 
> > the |
> >  |                    |             |             | Processor UID of a      
> >  |
> >  |                    |             |             | physical CPU, and 
> > should |
> >  |                    |             |             | be treated specially by 
> >  |
> >  |                    |             |             | the OSPM                
> >  |
> >
> > The list of UIDs in this new field would be matched against the ACPI 
> > Processor
> > UID field found in local/x2 APIC MADT structs and Processor objects in the 
> > ACPI
> > namespace, and the OSPM should either ignore those objects, or in case it
> > implements pCPU hotplug, it should notify Xen of changes to these objects.
> >
> > The contents of the MADT provided to the hardware domain are also going to 
> > be
> > different from the contents of the MADT as found in native ACPI. The 
> > local/x2
> > APIC entries for all the pCPUs are going to be marked as disabled.
> >
> > Extra entries are going to be added for each vCPU available to the hardware
> > domain, up to the maximum number of supported vCPUs. Note that supported 
> > vCPUs
> > might be different than enabled vCPUs, so it's possible that some of these
> > entries are also going to be marked as disabled. The entries for vCPUs on 
> > the
> > MADT are going to use a processor local x2 APIC structure, and the ACPI
> > processor ID of the first vCPU is going to be UINT32_MAX - HVM_MAX_VCPUS, in
> > order to avoid clashes with IDs of pCPUs.
> 
> This is slightly problematic.  There is no restriction (so far as I am
> aware) on which ACPI IDs the firmware picks for its objects.  They need
> not be consecutive, logical, or start from 0.
> 
> If STAO is being extended to list the IDs of the physical processor
> objects, we should go one step further and explicitly list the IDs of
> the virtual processor objects.  This leaves us flexibility if we have to
> avoid awkward firmware ID layouts.
> 
> It is also work stating that this puts an upper limit on nr_pcpus +
> nr_dom0_vcpus (but 4 billion processors really ought to be enough for
> anyone...)

Right, I think that I will change that to instead use dynamic ACPI processor
UIDs, and have Xen replace them from the AML.

> > In order to be able to perform vCPU hotplug, the vCPUs must have an ACPI
> > processor object in the ACPI namespace, so that the OSPM can request
> > notifications and get the value of the \_STA and \_MAT methods. This can be
> > problematic because Xen doesn't know the ACPI name of the other processor
> > objects, so blindly adding new ones can create namespace clashes.
> >
> > This can be solved by using a different ACPI name in order to describe 
> > vCPUs in
> > the ACPI namespace. Most hardware vendors tend to use CPU or PR prefixes for
> > the processor objects, so using a 'VP' (ie: Virtual Processor) prefix should
> > prevent clashes.
> 
> One system I have to hand (with more than 255 pcpus) uses Cxxx
> 
> To avoid namespace collisions, I can't see any option but to parse the
> DSDT/SSDTs to at least confirm that VPxx is available to use.

Hm, what about defining a new bus for Xen, so the SSDT would look like:

Device ( \_SB.XEN ) {
    Name ( _HID, "ACPI0004" ) /* ACPI Module Device */
}
Scope ( \_SB.XEN ) {
    OperationRegion ( ... )
    Processor ( VP00, 0, 0x0000b010, 0x06 ) {
        ...
    }
    Processor ( VP01, 1, 0x0000b010, 0x06 ) {
        [...]
    }
    OperationRegion ( PRST, SystemIO, 0xaf00, 1 )
    Field ( PRST, ByteAcc, NoLock, Preserve ) {
        PRS, 2
    }
    Method ( PRSC, 0 ) {
        Store ( ToBuffer(PRS), Local0 )
        Store ( DerefOf(Index(Local0, 0)), Local1 )
        And ( Local1, 1, Local2 )
        If ( LNotEqual(Local2, \_SB.XEN.VP00.FLG) ) {
            Store ( Local2, \_SB.XEN.VP00.FLG )
            If ( LEqual(Local2, 1) ) {
                Notify ( VP00, 1 )
                Subtract ( \_SB.XEN.MSU, 1, \_SB.XEN.MSU )
            }
            Else {
                Notify ( VP00, 3 )
                Add ( \_SB.XEN.MSU, 1, \_SB.XEN.MSU )
            }
        }
        [...]
        Return ( One )
    }
}
Device ( \_SB.XEN.GPEX ) {
    Name ( _HID, "ACPI0006" )
    Name ( _UID, "XENGPE" )
    Name ( _CRS, ResourceTemplate() {  IO (Decode16, 0xafe0 , 0xafe0, 0x00, 
0x4)} )
    Method ( _E02 ) {
        \_SB.XEN.PRSC ()
    }
}

With this I think we should be able to prevent any ACPI namespace clash, TBH, I
don't think vendors will ever use a "XEN" bus. Is there anyway we could reserve
such namespace with ACPI (_SB.XEN*)?

> This also has a chance of collision, both with the system ACPI
> controller, and also with PCIe devices advertising IO-BARs.  (All
> graphics cards ever have IO-BARs, because windows refuses to bind a
> graphics driver to a PCI graphics device if the PCI device doesn't have
> at least one IO-BAR.  Because PCIe requires 4k alignment on the upstream
> bridge IO-windows, there is a surprisingly low limit on the number of
> graphics cards you can put in a server and have functioning to windows
> satisfaction.)

Yes, I'm thinking about using SystemMemory instead of SystemIO, this way we can
use a guest RAM region that surely will not clash with anything else. This will
require to transform the IO handler into a memory handler, but it doesn't look
that complicated (and we will also waste a full memory page for it, but alas).

> As with the other risks of collisions, Xen is going to have to search
> the system to find a free area to use.
> 
> >             Field ( PRST, ByteAcc, NoLock, Preserve ) {
> >                 PRS, 2
> >             }
> >             Method ( PRSC, 0 ) {
> >                 Store ( ToBuffer(PRS), Local0 )
> >                 Store ( DerefOf(Index(Local0, 0)), Local1 )
> >                 And ( Local1, 1, Local2 )
> >                 If ( LNotEqual(Local2, \_SB.VP00.FLG) ) {
> >                     Store ( Local2, \_SB.VP00.FLG )
> >                     If ( LEqual(Local2, 1) ) {
> >                         Notify ( VP00, 1 )
> >                         Subtract ( \_SB.MSU, 1, \_SB.MSU )
> >                     }
> >                     Else {
> >                         Notify ( VP00, 3 )
> >                         Add ( \_SB.MSU, 1, \_SB.MSU )
> >                     }
> >                 }
> >                 ShiftRight ( Local1, 1, Local1 )
> >                 And ( Local1, 1, Local2 )
> >                 If ( LNotEqual(Local2, \_SB.VP01.FLG) ) {
> >                     Store ( Local2, \_SB.VP01.FLG )
> >                     If ( LEqual(Local2, 1) ) {
> >                         Notify ( VP01, 1 )
> >                         Subtract ( \_SB.MSU, 1, \_SB.MSU )
> >                     }
> >                     Else {
> >                         Notify ( VP01, 3 )
> >                         Add ( \_SB.MSU, 1, \_SB.MSU )
> >                     }
> >                 }
> >                 Return ( One )
> >             }
> >         }
> >         Device ( \_SB.GPEX ) {
> >             Name ( _HID, "ACPI0006" )
> >             Name ( _UID, "XENGPE" )
> >             Name ( _CRS, ResourceTemplate() {
> >                 IO (Decode16, 0xafe0 , 0xafe0, 0x00, 0x4)
> >             } )
> >             Method ( _E02 ) {
> >                 \_SB.PRSC ()
> >             }
> >         }
> >     }
> >
> > Since the position of the XEN data memory area is not know, the hypervisor 
> > will
> > have to replace the address 0xdeadbeef with the actual memory address where
> > this structure has been copied. This will involve a memory search of the AML
> > code resulting from the compilation of the above ASL snippet.
> 
> This is also slightly risky.  If we need to do this, can we get a
> relocation list from the compiled table from iasl?

I will look into ways to do this relocation. Jan's suggestion to do a compare
between two different AML outputs seems feasible. Will check if iasl supports
something similar to relocation (although I think it doesn't).

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.