|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [PATCH 3/3] Sphinx docs: Design Document for NUMA node affine claim sets
This design extends Xen's memory claim handling to support claim sets spanning multiple NUMA nodes. Roger Pau Monné described it as: Ideally, we would need to introduce a new hypercall that allows making claims from multiple nodes in a single locked region, as to ensure success or failure in an atomic way. This design documents this model in detail and is integrated into the Sphinx site below Hypervisor Guide -> Design Documents -> NUMA Claims. Signed-off-by: Bernhard Kaindl <bernhard.kaindl@xxxxxxxxxx> --- docs/designs/claims/accounting.rst | 331 +++++++++++++++ docs/designs/claims/design.rst | 243 +++++++++++ docs/designs/claims/development.rst | 197 +++++++++ docs/designs/claims/implementation.rst | 393 ++++++++++++++++++ docs/designs/claims/index.rst | 48 +++ docs/designs/claims/installation.rst | 70 ++++ docs/designs/claims/invariants.mmd | 35 ++ docs/designs/claims/performance.rst | 33 ++ docs/designs/claims/protection.rst | 200 +++++++++ docs/designs/claims/redeeming.rst | 71 ++++ docs/designs/claims/terminology.rst | 138 ++++++ docs/designs/claims/use-cases.rst | 39 ++ docs/designs/index.rst | 1 + docs/glossary.rst | 12 +- .../dom/DOMCTL_claim_memory-data.mmd | 43 ++ .../dom/DOMCTL_claim_memory-seqdia.mmd | 23 + .../dom/DOMCTL_claim_memory-workflow.mmd | 23 + docs/guest-guide/dom/DOMCTL_claim_memory.rst | 221 ++++++++++ docs/guest-guide/dom/index.rst | 14 + docs/guest-guide/index.rst | 23 + docs/guest-guide/mem/XENMEM_claim_pages.rst | 102 +++++ docs/guest-guide/mem/index.rst | 12 + 22 files changed, 2269 insertions(+), 3 deletions(-) create mode 100644 docs/designs/claims/accounting.rst create mode 100644 docs/designs/claims/design.rst create mode 100644 docs/designs/claims/development.rst create mode 100644 docs/designs/claims/implementation.rst create mode 100644 docs/designs/claims/index.rst create mode 100644 docs/designs/claims/installation.rst create mode 100644 docs/designs/claims/invariants.mmd create mode 100644 docs/designs/claims/performance.rst create mode 100644 docs/designs/claims/protection.rst create mode 100644 docs/designs/claims/redeeming.rst create mode 100644 docs/designs/claims/terminology.rst create mode 100644 docs/designs/claims/use-cases.rst create mode 100644 docs/guest-guide/dom/DOMCTL_claim_memory-data.mmd create mode 100644 docs/guest-guide/dom/DOMCTL_claim_memory-seqdia.mmd create mode 100644 docs/guest-guide/dom/DOMCTL_claim_memory-workflow.mmd create mode 100644 docs/guest-guide/dom/DOMCTL_claim_memory.rst create mode 100644 docs/guest-guide/dom/index.rst create mode 100644 docs/guest-guide/mem/XENMEM_claim_pages.rst create mode 100644 docs/guest-guide/mem/index.rst diff --git a/docs/designs/claims/accounting.rst b/docs/designs/claims/accounting.rst new file mode 100644 index 000000000000..cf0aad56a0a8 --- /dev/null +++ b/docs/designs/claims/accounting.rst @@ -0,0 +1,331 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +Accounting +########## + +.. contents:: Table of Contents + :local: + +.. note:: + Claims accounting state is only updated while holding the :c:var:`heap_lock`. + See :ref:`designs/claims/accounting:Locking of the claims state` for details + on the locks used to protect the claims accounting state. + +This section formalises the internal state and invariants that Xen must +maintain to ensure correctness. + + +For readers following the design in order, the preceding sections are: + +1. :doc:`/designs/claims/design` introduces the overall model and goals. +2. :doc:`/designs/claims/installation` explains how claim sets are installed. +3. :doc:`/designs/claims/protection` describes how claimed memory is protected + during allocation. +4. :doc:`/designs/claims/redeeming` explains how claims are redeemed when + allocations succeed. + +Overview +^^^^^^^^ + +.. table:: Table 1: Claims accounting: All accesses, Aggregate state, + and invariants protected by :c:var:`heap_lock`. + :widths: auto + + ============ =========================================== ======================= + Level Claims must be lower or equal to the available memory + ============ =========================================== ======================= + Total :c:var:`outstanding_claims` = :c:var:`total_avail_pages` = + + = Aggregate state: + SUM() over all domains: Aggregate state: + SUM(:c:member:`domain.outstanding_pages`) SUM(:c:var:`node_avail_pages`) + + Also, it is the sum of claims + over all nodes: + + = Aggregate state: + SUM(:c:expr:`node_outstanding_claims[*]`) + Node :c:expr:`node_outstanding_claims[node]` :c:expr:`node_avail_pages[node]` + + Aggregate state over all domains: Aggregate of the free + SUM(:c:expr:`domain.claims[node]`) lists of all zones on node + Dom per-node :c:member:`domain.node_claims` = + SUM(:c:expr:`domain.claims[node]`) :c:expr:`node_avail_pages[node]` + Total claims :c:member:`domain.outstanding_pages` :c:var:`total_avail_pages` + Memory limit :c:member:`domain.outstanding_pages` Invariant: must be + + :c:func:`domain_tot_pages` lower or equal to + :c:member:`domain.max_pages` + ============ =========================================== ======================= + + +Total claims and available memory +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + These variables tracking the total claims and available memory in the system + are aggregates of the actual per-node and per-domain values. + + + They are only maintained for efficient checks in the allocator hot paths, to + quickly determine if an allocation can be satisfied from unclaimed memory or + if further checks are needed to determine if the claims of the domain can be + used to free up memory for the allocation. This also ensures that the sum of + all claims never exceeds the total free memory in the system. + + + The number of unclaimed pages across all nodes in the system is derived as + :c:var:`total_avail_pages` minus :c:var:`outstanding_claims`. + This number is then used to: + + - Permit allocation requests if they can be satisfied from unclaimed pages. + - Ensure that the sum of all claims never exceeds the total free memory. + + .. c:var:: unsigned long total_avail_pages + + Total available pages in the system across all NUMA nodes. + It is the aggregate of the per-node available pages: + :c:var:`total_avail_pages` = SUM(:c:expr:`node_avail_pages[MAX_NUMNODES]`) + + .. c:var:: unsigned long outstanding_claims + + The total sum of all claims across all domains. + :c:var:`outstanding_claims` = + SUM(:c:var:`domain.outstanding_pages`) + +Per-node claims and available memory +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + .. c:var:: unsigned long node_avail_pages[MAX_NUMNODES] + + Available pages for each NUMA node, including both free and claimed pages. + This is used for validating that node claims do not exceed the available + memory on the respective NUMA node. + + .. c:var:: unsigned long node_outstanding_claims[MAX_NUMNODES] + + The total claims across all domains for each NUMA node, indexed by node + ID. This is maintained for efficient checks in the allocator hot paths. + +This diagram illustrates the claims accounting state and the invariants: + +Accounting diagram +^^^^^^^^^^^^^^^^^^ + + .. mermaid:: invariants.mmd + :caption: Diagram: Claims accounting state and invariants + +Claims accounting state for each domain +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + .. c:struct:: domain + + The main structure representing a domain in Xen. It includes the + claims accounting state for the domain, including both host-wide + and node-specific claims, as well as the maximum page limits for the + domain and the lock protecting the domain's page allocation counts. + + While the domain's page counts are currently `unsigned int`, work is + underway to change them to `unsigned long` to support larger page counts + beyond 16 TB. The code is already designed to anticipate this change and + work with either `unsigned int` or `unsigned long` page counts equally well. + + .. c:member:: unsigned int outstanding_pages + + The domain's total claim, representing the number of pages claimed + for the domain. + + .. c:member:: unsigned int node_claims + + The total of the domain's node-affine claims, maintained for efficient + checks in the allocator hot paths without needing to sum over the + per-node claims each time. It is equal to the sum of + :c:expr:`claims[MAX_NUMNODES]` for all nodes. + + .. c:member:: unsigned int claims[MAX_NUMNODES] + + The domain's claims for each :term:`NUMA node`, indexed by node ID. + + As the storage for ``struct`` :c:struct:`domain` is allocated using a + dedicated page for each domain, this array allows for efficient and + fast storage with direct indexing, without consuming any additional + memory for an extra allocation. + + + The claims for each node are used for NUMA-affine domains to specify + the amount of memory claimed for each node, to ensure that the domain's + claims for each node do not exceed the available memory on that node, + and to allow the allocator to redeem claims from the appropriate nodes + when allocating memory for the domain. + + .. literalinclude:: ../../../xen/common/domain.c + :language: C + :caption: Allocation of the domain structure in ``xen/common/domain.c`` + :start-at: alloc_domain_struct + :end-at: } + :emphasize-lines: 7, 12, 14 + :linenos: + :lineno-match: + + The page allocated for ``struct`` :c:struct:`domain` is large enough + to accommodate this array several times, even beyond the current + :c:macro:`MAX_NUMNODES` limit of 64. It should be sufficient even for + future expansion of the maximum number of supported NUMA nodes if + needed. The allocation has a build-time assertion for safety to ensure + that ``struct`` :c:struct:`domain` fits within the allocated page. + + + The sum of these claims is stored in :c:member:`domain.node_claims` + for efficient checks in the allocator hot paths which need to know + the total number of node claims for the :term:`domain`. + + .. c:member:: unsigned int max_pages + + The maximum number of pages the domain is allowed to claim, set at + domain creation time. + + .. c:member:: rspinlock_t page_alloc_lock + + Lock for checking :c:func:`domain_tot_pages` on top of new claims + against :c:member:`domain.max_pages` when installing these new claims. + This is a recursive spinlock to allow for nested calls into the allocator + while holding it, such as when redeeming claims during page allocation. + It is taken before :c:var:`heap_lock` when installing claims to ensure a + consistent locking order and must not be taken while holding + :c:var:`heap_lock` to avoid deadlocks. + + .. c:member:: nodemask_t node_affinity + + A :c:type:`nodemask_t` representing the set of NUMA nodes the domain + is affine to. This is used for efficient checks in the allocator hot + paths to quickly get the set of nodes a domain is affine to for + memory allocation decisions. + +Claims accounting invariants +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + Xen must maintain the following invariants at all times to ensure correctness + of claims accounting: + + - For all claims, including node-affine and host-wide claims: + :c:var:`outstanding_claims` :math:`\le` :c:var:`total_avail_pages` + + - For node-specific claims: + :c:expr:`node_outstanding_claims[alloc_node]` :math:`\le` + :c:expr:`node_avail_pages[alloc_node]` + + - For a domain's overall claims: + :c:var:`domain.outstanding_pages` + + :c:var:`domain_tot_pages` :math:`\le` :c:var:`domain.max_pages` + + See :doc:`redeeming` for more information on this invariant. + +Constants +^^^^^^^^^ + + .. c:macro:: MAX_NUMNODES + + The maximum number of NUMA nodes supported by Xen. Used for validating + node IDs in the :c:type:`memory_claim_t` entries of claim sets. + When Xen is built without NUMA support, it is 1. + + The default on x86_64 is 64 which is sufficient for current hardware and + allows for efficient storage of e.g. the :c:var:`node_online_map` for + online nodes and :c:member:`domain.node_affinity` in a single 64-bit value, + and in the :c:expr:`domain.claims[MAX_NUMNODES]` array. + + ``xen/arch/Kconfig`` limits the maximum number of NUMA nodes to 64. While + Xen can be compiled for up to 254 nodes, configuring machines to split + the installed memory into more than 64 nodes would be unusual. + For example, dual-socket servers, even when using multiple chips per CPU + package should typically be configured for 2 NUMA nodes by default. + + .. c:var:: nodemask_t node_online_map + + A bitmap representing which NUMA nodes are currently online in the system. + This is used for validating that claims are only made for online nodes and + for efficient checks in the allocator hot paths to quickly determine which + nodes are online. Currently, Xen does not support hotplug of NUMA nodes, + so this is set at boot time based on the platform firmware configuration + and does not change at runtime. + +Types +^^^^^ + + .. c:type:: uint8_t nodeid_t + + Type for :term:`NUMA node` IDs. It is passed to Xenctrl using the + :c:var:`mem_flags` argument of :c:func:`xc_domain_populate_physmap()` + and passed to Xen in this form. + + It allocates 8 bits in the flags for the node ID, which limits the + theoretical maximum value of :c:macro:`CONFIG_NR_NUMA_NODES` at 254 + (255 is :c:macro:`NUMA_NO_NODE`), which is far beyond the current + maximum of 64 supported by Xen and should be sufficient for all + practical purposes. This also allows for efficient storage of NUMA + nodes in arrays indexed by node ID and in :c:type:`nodemask_t` bitmaps + :c:var:`node_online_map` and :c:member:`domain.node_affinity` for + efficient checks in the allocator hot paths. + + .. c:type:: nodemask_t + + A bitmap representing a set of NUMA nodes, used for status information + like :c:var:`node_online_map` and the :c:member:`domain.node_affinity`, + and to track which nodes are online and which nodes are in a domain's + node affinity. + +Memflags +^^^^^^^^ + + .. c:type:: memflags + + Flags for memory allocation requests that can affect the allocation + behaviour, such as node preference and whether the request is for an + exact node. + + .. c:macro:: MEMF_no_owner + + Flag for memory allocation requests to indicate that the allocation + shall not be owned by a domain, and as part of that, + :c:macro:`MEMF_no_refcount` is also set. + + .. c:macro:: MEMF_no_refcount + + Flag for memory allocation requests to indicate that the request is not + reference-counted to a domain's memory allocation state, and as part of + that, claims of a domain cannot be used to protect and redeem the + allocation using claims. This is used for requests which are not for + domains or which explicitly bypass reference-counting for other reasons. + + .. c:macro:: MEMF_no_scrub + + Flag for memory allocation requests to indicate that the allocated memory + should not be scrubbed (zeroed) before being used. This is used for + performance reasons for certain types of allocations where the caller + guarantees that the memory will be properly initialized before use. + +Locking of the claims state +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + .. :c:member:: domain.page_alloc_lock + + If :c:var:`domain.page_alloc_lock` is needed, e.g. to check + :c:func:`domain_tot_pages` on top of new claims against + :c:var:`domain.max_pages` for the domain, it needs to be taken before + :c:var:`heap_lock` for consistent locking order to avoid deadlocks. + + .. c:var:: spinlock_t heap_lock + + Lock for all heap operations including claims. It protects the claims + state and invariants from concurrent updates and ensures that checks + in the allocator hot paths see a consistent view of the claims state. + +Helper functions +^^^^^^^^^^^^^^^^ + + .. c:function:: inline unsigned int domain_tot_pages(struct domain *d) + + :param d: The domain for which to calculate the total pages. + :type d: struct domain * + :returns: The total pages allocated to the domain. + + This function is used for validating that an allocation and the domain's + claims do not exceed :c:member:`domain.max_pages`. diff --git a/docs/designs/claims/design.rst b/docs/designs/claims/design.rst new file mode 100644 index 000000000000..882dc3c5c1f1 --- /dev/null +++ b/docs/designs/claims/design.rst @@ -0,0 +1,243 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +############# +Claims Design +############# + +.. contents:: Table of Contents + :backlinks: entry + :local: + +************ +Introduction +************ + +Xen's page allocator supports a :term:`claims` API that allows privileged +:term:`domain builders` to reserve a quantity of available memory before +:term:`populating` the :term:`guest physical memory` of new :term:`domains` +they are creating, configuring, and building. + +These reservations are called :term:`claims`. They ensure that the claimed +memory remains available for the :term:`domains` when allocating it, even if +other :term:`domains` are allocating memory at the same time. + +:term:`Installing claims` is a privileged operation performed by +:term:`domain builders` before they populate the :term:`guest physical memory`. +This prevents other :term:`domains` from allocating memory earmarked for +:term:`domains` under construction. Xen maintains the per-domain claim state +for pages that are claimed but not yet allocated. + +When claim installation succeeds, Xen updates the claim state to reflect the +new targets and protects the claimed memory until it is allocated or the claim +is released. As Xen allocates pages for the domain, claims are redeemed by +reducing the claim state by the size of each allocation. + +************ +Design Goals +************ + +The design's primary goals are: + +1. Allow :term:`domain builders` to claim memory + on multiple :term:`NUMA nodes` using a :term:`claim set` atomically. + +2. Preserve the existing :c:macro:`XENMEM_claim_pages` hypercall command + for compatibility with existing :term:`domain builders` and its legacy + semantics, while introducing a new, unrestricted hypercall command for + new use cases such as NUMA-aware claim sets. + +3. Host-wide claims are supported for compatibility with existing + :term:`domain builders` and for use cases where a flexible claim that + can exists in the level of the host is desirable. + + It means the global outstanding claims count of the host is not obsolete + and needs to be maintained as it needs to account for such host-wide claims. + +4. Use fast allocation-time claims protection in the allocator's hot paths + to protect claimed memory from parallel allocations by other domain + builders in case of parallel domain builds, and to protect claimed + memory from all other allocations as well. + +*************** +Design Overview +*************** + +The legacy :c:macro:`XENMEM_claim_pages` hypercall is superseded by +:c:macro:`XEN_DOMCTL_claim_memory`. This hypercall installs a :term:`claim set`. +It is an array of :c:type:`memory_claim_t` entries, where each entry specifies +a page count and a target: either a specific NUMA node ID or a selector. + +Like legacy claims, claim sets are validated and installed under +:c:member:`domain.page_alloc_lock` and :c:var:`heap_lock`: Either the entire +set is accepted, or the request fails with no side effects. Repeated calls +to install claims replace any existing claims for the domain rather than +accumulating. + +As installing claim sets after allocations is not a supported use case, +the legacy behaviour of subtracting existing allocations from installed +claims is somewhat surprising and counterintuitive, and page exchanges +make incremental per-node tracking of already-allocated pages on a per-node +basis difficult. Therefore, claim sets do not retain the legacy behaviour of +subtracting existing allocations, optionally on a per-node basis, from the +installed claims across the individual claim set entries. + +Summary: + +- Legacy domain builders can continue to use the previous (now deprecated) + :c:expr:`XENMEM_claim_pages` hypercall command to install legacy claims. + +- Updated domain builders can take advantage of claim sets to install + NUMA-aware :term:`claims` on multiple :term:`NUMA nodes` and/or claims + that are not bound to specific nodes. It has more intuitive semantics + that do not subtract existing allocations from the installed claims. + Such semantics are also simpler to understand and maintain, and are not + affected by the complexity of tracking existing allocations on a per-node + basis across page exchanges happening concurrently with claim installation + for new domains under construction. + +For readers following the design in order, the next sections cover the +following topics: + +1. :doc:`/designs/claims/installation` explains how claim sets are installed. +2. :doc:`/designs/claims/protection` describes how claimed memory is + protected during allocation. +3. :doc:`/designs/claims/redeeming` explains how claims are redeemed as + allocations succeed. +4. :doc:`/designs/claims/accounting` describes the accounting model that + underpins those steps. + +******************** +Key design decisions +******************** + +.. glossary:: + + :c:expr:`node_outstanding_claims[MAX_NUMNODES]` + Tracks the sum of all claims on a node. :c:func:`get_free_buddy()` checks + it before scanning zones on a node, so claimed memory is protected from + other allocations. + + :c:expr:`redeem_claims_for_allocation()` + When allocating memory for a domain, the page allocator redeems the matching + claims for this allocation, ensuring the domain's total memory allocation as + :c:func:`domain_tot_pages` plus :c:member:`domain.outstanding_pages` remain + within the domain's limits, defined by :c:member:`domain.max_pages`. + See :doc:`redeeming` for details on redeeming claims. + + :c:expr:`domain.outstanding_pages` + It remains the authoritative source for the total outstanding claims of a + domain, and is updated on claim installation and redemption. It includes + both host-wide claims and node-specific claims. + Support for :term:`host-wide claims` is maintained for two reasons: first, + for compatibility with existing domain builders, and second, for use cases + where a flexible claim that can be satisfied from any node is desirable. + + When the preferred NUMA node(s) for a domain do not have sufficient free + memory to satisfy the domain's memory requirements, host-wide claims provide + a flexible fallback for the memory shortfall from the preferred node(s) that + can be satisfied from any available node. + + In this case, :term:`domain builders` can use a combination of passing + the preferred node to :c:func:`xc_domain_populate_physmap()` and + :term:`NUMA node affinity` to steer allocations towards the preferred + NUMA node(s), while letting host-wide claims ensure that the shortfall + is available. + + This allows the domain builder to define a set of desired NUMA nodes to + allocate from and even specify which nodes to prefer for an allocation, + but the claim for the shortfall is flexible, not specific to any node. + +********* +Non-goals +********* + +Using per-node allocator data +============================= + +Some data structures could be moved into the per-node allocator data +allocated by `init_node_heap()` to avoid bouncing those data structures +between nodes. Those can be moved to the per-node allocator data in the +future, but that is not a priority. While that would reduce this bouncing, +it would not eliminate the need to take the global :c:var:`heap_lock`, +which is still needed to protect the allocator's state during allocation +and freeing of pages. + +The synchronisation point for taking the global :c:var:`heap_lock` is +the main point of contention during allocation, freeing and scrubbing +pages. The overhead of accessing the per-node claims accounting data +is expected to be minimal. + +Avoiding the :c:var:`heap_lock` would be difficult to achieve as it +would require updating the page allocator to maintain atomic updates +of a new ``total_unclaimed_pages`` counter, which would be decremented +on allocation and claims installation and incremented on freeing of +pages and claims, and to check that counter in the hot path of the +allocator to protect claimed memory from other allocations. + +However, we aim to move that data into the per-node allocator data in the +future to reduce the need to bounce those data structures between nodes. + +Legacy behaviours +================= + +Installing claims is a privileged operation performed by domain builders +before they populate guest memory. As such, tracking previous allocations +is not in scope for claims. + +For the following reasons, claim sets do not retain the legacy behaviour +of subtracting existing allocations from installed claims: + +- Xen does not currently maintain a ``d->node_tot_pages[node]`` count, + and the hypercall to exchange extents of memory with new memory makes + such accounting relatively complicated. + +- The legacy behaviour is somewhat surprising and counterintuitive. + Because installing claims after allocations is not a supported use case, + subtracting existing allocations at installation time is unnecessary. + +- Claim sets are a new API and can provide more intuitive semantics + without subtracting existing allocations from installed claims. This + also simplifies the implementation and makes it easier to maintain. + +Versioned hypercall +=================== + +The :term:`domain builders` using the :c:macro:`XEN_DOMCTL_claim_memory` +hypercall also need to use other version-controlled hypercalls which +are wrapped through the :term:`libxenctrl` library. + +Wrapping this call in :term:`libxenctrl` is therefore a practical approach; +otherwise, we would have a mix of version-controlled and unversioned +:term:`hypercalls`, which could be confusing for API users and for future +maintenance. + +From the domain builders' viewpoint, it is more consistent to expose +the claims :term:`hypercalls` in the same way as the other calls they use. + +Stable interfaces also have drawbacks: with stable syscalls, Linux needs +to maintain the old interface indefinitely, which can be a maintenance burden +and can limit the ability to make improvements or changes to the interface +in the future. Linux carries many system call successor families, e.g., +``oldstat``, ``stat``, ``newstat``, ``stat64``, ``fstatat``, ``statx``, +with similar examples including ``openat``, ``openat2``, ``clone3``, ``dup3``, +``waitid``, ``mmap2``, ``epoll_create1``, ``pselect6`` and many more. +Glibc hides that complexity from users by providing a consistent API, but it +still needs to maintain the old system calls for compatibility. + +In contrast, versioned :term:`hypercalls` allow for more flexibility and +evolution of the API while still providing a clear path to adopt new features. +The reserved fields and reserved bits in the structures of this hypercall +allow for many future extensions without breaking existing callers. + +***************** +Future extensions +***************** + +The reserved fields and bits in the structures of this +hypercall allow for many future extensions without breaking existing callers. + +Future extensions could include support for claims on superpages, claims for +requests with :c:macro:`MEMF_no_refcount`, which allocate P2M, HAP and so on. + +See :ref:`designs/claims/protection:Callers using MEMF_no_refcount` +for more information. diff --git a/docs/designs/claims/development.rst b/docs/designs/claims/development.rst new file mode 100644 index 000000000000..c4805b2e080d --- /dev/null +++ b/docs/designs/claims/development.rst @@ -0,0 +1,197 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +Development +########### + +.. note:: + + This section provides historical context on the development + of NUMA-aware claims, including previous implementations and + feedback received, to give a better understanding of the + design decisions made in the current implementation. + +Version history +--------------- + +The initial `implementation of single-node claims <v1_>`_ by Alejandro Vallejo +used the legacy claims hypercall :c:macro:`XENMEM_claim_pages` and passed a +NUMA node in the existing NUMA node bits of +:c:expr:`xen_memory_reservation.mem_flags`. This added the flag +``d->claim_node`` to ``struct`` :c:struct:`domain`, which defined the target +node for the domain's claims. + +.. epigraph:: + + Roger Pau Monné reviewed it and proposed an `initial multi-node claim-sets + specification <v1m_>`_ that inspired this design: + + The interface here seems to be focused on domains only being allowed to + allocate from a single node, or otherwise you must first allocate memory + from a node before moving to the next one (which defeats the purpose of + claims?). + + I think we want to instead convert ``d->outstanding_pages`` + into a per-node array, so that a domain can have outstanding + claims for multiple NUMA nodes? + + The hypercall interface becomes a bit awkward then, as the toolstack has + to perform a different hypercall for each memory claim from a different + node (and rollback in case of failure). Ideally we would need to introduce + a new hypercall that allows making claims from multiple nodes in a single + locked region, as to ensure success or failure in an atomic way. + + -- Roger Pau Monné + + This led to the `v2 <v2_>`_ and `v3 <v3_>`_ series, adding a new hypercall + API which designated passing an array of claims. This allowed for a more + flexible claim set design targeting multiple NUMA nodes and host-wide claims, + but only supported a single claim per domain at that time. + +.. sidebar:: Feedback and suggestions for multi-node claim sets + + The initial implementations of single-node claims received + feedback from the community, with multiple suggestions to + extend the API to support `multi-node claim sets <v1m_>`_. + This feedback highlighted the need for a more flexible + design that could accommodate claims on multiple NUMA nodes. + +Between v3 and v4, `Roger Pau Monné and Andrew Cooper developed and merged +several critical fixes <fix1_>`_ for Xen's overall claims implementation. +These fixes also allowed Roger to improve the implementation for redeeming +claims during domain memory allocation. With a further suggestion by +Bernhard Kaindl, this enabled a fully working implementation that protected +claimed memory against parallel allocations by other domain builders. + +.. glossary: + +v4 series + With the `v4 series <v4_>`_, we submitted the combined work that completed the + fixes for protecting claimed memory on NUMA nodes. The review process indicated + that supporting multiple claim sets would require a `redesign <v4-03_>`_ of + claim installation and management, which led to this design document. + +v5 series + The `v5 series <v5_>`_ implemented the `Claim Sets Design Version 1 <d1>`_ + with support for multiple claim records per domain, and with a terminology + of "consuming claims" for the process of redeeming claims during domain memory + allocation and "retiring claims" for releasing claims for the low-level + action of reducing the number of claimed pages for redeeming claims for an + allocation, when destroying a domain or when claims must be recalled if all + memory is claimed and then memory is offlined, which is needed to maintain + the invariant that claimed memory can never be larger than free memory. + +v6 series + The `v6 series <v6_>`_ implemented the `Claim Sets Design Version 2 <d2>`_. + The only difference between the two versions is that with design version 2, + the initial term `"consuming claims"` was changed to `"redeeming claims"` + and the term `"retiring claims"` was changed to `"deducting claims"`. + +v7 series +--------- + + The v7 series will implement the `Claim Sets Design Version 3 <d3>`_ or newer + with further improvements to the design and implementation: + + 1. As the code often needs the total sum of claims of a domain, this update + keeps :c:member:`domain.outstanding_pages` as the total sum of outstanding + claims of a domain. This obsoletes the former ``d->global_claims`` which + only tracked the unbound claims which were not affine to a NUMA node. + + 2. Avoid code duplication: Replace :c:func:`domain_set_outstanding_pages()`, + which handled the legacy claims hypercall :c:macro:`XENMEM_claim_pages`. + The new claim sets hypercall handler :c:func:`domain_install_claim_set()`, + integrates installing legacy claims for backwards compatibility. + The former :c:func:`domain_set_outstanding_pages()` is removed to + avoid duplicating the logic of installing claims in both places. + + 3. Improve the clarity of function and variable names. For example, the new + hypercall handler :c:func:`domain_install_claim_set()` is more descriptive + of its purpose than the former :c:func:`domain_set_outstanding_pages()`: + + ================================== ======================================= + Former function name New function name + ================================== ======================================= + ``domain_set_outstanding_pages()`` :func:`domain_set_claim_entries()` + :func:`domain_get_claim_entries()` + ``deduct_global_claims()`` :func:`domain_release_host_claims()` + ``deduct_node_claims()`` :func:`domain_release_node_claims()` + ================================== ======================================= + +Testing +------- + +The basis of the `v4 series <v4_>`_ is included in the XenServer XS9 preview +release, and besides functional product testing, it has been tested to +meet the performance expectation of customers from improved NUMA placement. + +With the `v6 series <v6_>`_, a comprehensive set of functional system tests +was added to the submission. Also, `a separate host-side integration test +suite <tv2_>`_ for validating the `v6 series <v6_>`_ was posted. + +Further development +------------------- + +Based on review feedback, there is the wish to normalise the page counts of +the page allocator to ``unsigned long``. A `first patch <u1_>`_ in this direction +was posted to normalise the types of :c:var:`total_avail_pages` and +:c:var:`outstanding_claims` to ``unsigned long`` in the page allocator. + +Acknowledgements +---------------- + +The claim sets design builds on the single-node claims implementation +described above and the feedback it generated. The following people +should be acknowledged for their contributions: + +- *Edwin Török* for developing the `initial best-effort NUMA placement + feature in the XAPI toolstack <xapi_>`_, which inspired the initial + implementation of NUMA-aware claims, and his work in productizing and + validating the integration of NUMA claims with the XAPI toolstack. + +- *Alejandro Vallejo* for starting the development of the NUMA claims series. + +- *Jan Beulich* for providing review suggestions that led to many improvements. + +- *Roger Pau Monné* for reviewing the initial implementation, `proposing + the initial multi-node claim-sets specification <_v1>`_, developing and + merging `critical fixes <fix1_>`_ upstream that enabled product-quality + support for single-node claims which is the basis of the multi-node + claim sets implementation. + +- *Andrew Cooper* for integrating and validating the work internally, + helping to stabilise and productise the single-node implementation. + +- *Bernhard Kaindl* for collaborating on the single-node implementation, + developing the claim sets hypercall since version 2, designing and + implementing the multi-node claim sets design, the functional system-level + test suite and the host-side integration test suite for validating the + claim sets implementation. + +- *Marcus Granado* for leading the development effort inside XenServer for + productising the single-node claims implementation, for providing feedback + and suggestions for improving the design and implementation. This included + coordinating the work of multiple contributors and stakeholders, integrating + the work into XenServer products and ensuring it meets customer requirements. + +.. _xapi: https://xapi-project.github.io/new-docs/toolstack/features/NUMA +.. _fix1: + https://lists.xenproject.org/archives/html/xen-devel/2026-01/msg00164.html +.. _v1: + https://patchew.org/Xen/20250314172502.53498-1-alejandro.vallejo@xxxxxxxxx/ +.. _v1m: + https://lists.xenproject.org/archives/html/xen-devel/2025-06/msg00484.html +.. _v2: https://lists.xen.org/archives/html/xen-devel/2025-08/msg01076.html +.. _v3: https://patchew.org/Xen/cover.1757261045.git.bernhard.kaindl@xxxxxxxxx/ +.. _v4: + https://lists.xenproject.org/archives/html/xen-devel/2026-02/msg01387.html +.. _v4-03: https://patchwork.kernel.org/project/xen-devel/ + patch/6927e45bf7c2ce56b8849c16a2024edb86034358.1772098423 + .git.bernhard.kaindl@xxxxxxxxxx/ +.. _d1: + https://bernhard-xen.readthedocs.io/en/claim-sets-v1-design/designs/claims +.. _d2: + https://bernhard-xen.readthedocs.io/en/claim-sets-v2-design/designs/claims +.. _v5: https://patchwork.kernel.org/project/xen-devel/list/?series=1078053 +.. _v6: https://patchwork.kernel.org/project/xen-devel/list/?series=1081139 +.. _tv2: https://patchwork.kernel.org/project/xen-devel/list/?series=1083329 +.. _u1: https://patchwork.kernel.org/project/xen-devel/list/?series=1084344 diff --git a/docs/designs/claims/implementation.rst b/docs/designs/claims/implementation.rst new file mode 100644 index 000000000000..ed8ed82877af --- /dev/null +++ b/docs/designs/claims/implementation.rst @@ -0,0 +1,393 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +Implementation +############## + +.. contents:: Table of Contents + :backlinks: entry + :local: + +.. note:: This part describes implementation details of claims and their + interaction with memory allocation in Xen. It covers the functions and + data structures involved in :term:`installing claims` and allocating memory + with :term:`claims`. + +Functions related to the implementation of claims and their interaction +with memory allocation. + +********************** +Installation of claims +********************** + +This section describes the functions and data structures involved in +:term:`installing claims` for domains, and the internal functions for +validating and installing claim sets. + + .. c:function:: int domain_set_outstanding_pages(domain, pages) + + This function is replaced by :c:func:`domain_set_claim_entries()`. + + .. c:function:: int domain_set_claim_entries(domain, nr_entries, claim_set) + + :param domain: The domain for which to set the node claims + :param nr_entries: The number of claims in the claim set + :param claim_set: The claim set to install for the domain + :type domain: struct domain * + :type nr_entries: unsigned int + :type claim_set: memory_claim_t * + :returns: 0 on success, or a negative error code on failure. + + Handles :term:`installing claim sets`. It performs validation of the + :term:`claim set` and updates the domain's claims accordingly. + + The function works in four phases: + + 1. Validate claim entries and check node-specific claims availability + 2. Validate the host-wide request against the remaining availability + 3. Reset any current claims of the domain + 4. Install the claim set as the domain's claiming state + + Phase 1 checks claim entries for validity and memory availability: + + 5. Target must be :c:macro:`XEN_DOMCTL_CLAIM_MEMORY_TOTAL` or a node. + 6. Each target node may only appear once in the claim set. + 7. For node-specific claims, requested pages must not exceed the + available memory on that node after accounting for existing claims. + 8. The explicit padding field must be zero for forward compatibility. + + Phase 2 checks: + + 9. The total sum of the requested pages must not exceed the total + unclaimed memory of the host after accounting for existing claims. + 10. The claims must not exceed the :c:member:`domain.max_pages` limit. + See :doc:`accounting` and :doc:`redeeming` for the accounting + checks that enforce the domain's :c:member:`domain.max_pages` limit. + + .. versionadded:: claims-v5 + + .. c:function:: int domain_get_claim_entries(domain, nr_entries, claim_set) + + :param domain: The domain for which to retrieve a claim set + :param nr_entries: The number of claims in the claim set + :param claim_set: The preallocated buffer for up to nr_entries claim entries + :type domain: struct domain * + :type nr_entries: unsigned int * + :type claim_set: memory_claim_t * + :returns: 0 on success with nr_entries updated to the number of claims + written to the buffer, or a negative error code on failure. + + Retrieves a claim set for the current claims of the domain and writes + it to the provided buffer. The number of claims written to the buffer + is stored in the variable pointed to by ``nr_entries``. + + ``nr_entries`` specifies the size of the provided buffer for claim + entries, and the function writes up to that many claim entries to + the buffer. If the buffer is too small to hold all claim entries, + the function returns -:c:macro:`ERANGE` and updates ``nr_entries`` + to the number of entries needed to hold all claim entries. + + .. versionadded:: claims-v7 + +************************************ +Helper functions for managing claims +************************************ + + .. c:function:: unsigned long domain_release_host_claims(domain, release) + + :param domain: The domain for which to release host-wide claims + :param release: The number of pages to release + :type domain: struct domain * + :type release: unsigned long + :returns: The number of host-wide pages actually deducted from the domain. + + This function releases the specified number of host-wide claims. + It limits the release to the number of host-wide claims actually held by + the domain and updates the overall claim state accordingly. + + .. versionadded:: claims-v4 + + .. c:function:: unsigned long domain_release_node_claims(domain, node, release) + + :param domain: The domain for which to release the node claims + :param node: The node for which to release the claim + :param release: The number of pages to release from the claim + :type domain: struct domain * + :type node: nodeid_t + :type release: unsigned long + :returns: The number of pages actually deducted from the domain's claim. + + This function deducts a specified number of pages from a domain's + claim on a specific node. It limits the release to the number of + pages actually claimed by the domain on that node and updates the + node-local claims currently held by the domain on that node, + and it updates the host-wide and node-specific claim state accordingly. + + .. versionadded:: claims-v5 + + .. c:function:: void domain_recall_node_claims(domain, recall) + + :param domain: The domain for which to recall node claims + :param recall: The number of node-specific pages to recall + :type domain: struct domain * + :type recall: unsigned long + + This function recalls the specified number of node-specific claims + from the domain and updates the overall claim state accordingly. + + It iterates over the domain's node-specific claims, calls + :c:func:`domain_release_node_claims()` to up to the given pages from + the node claims until the specified number of pages has been recalled, + or all node-specific claims have been exhausted. + + This function is used to recall node-specific claims from a domain when + offlining memory or when pages for a domain are allocated on other + nodes than the claimed node. + + .. versionadded:: claims-v5 + +********************** +Allocation with claims +********************** + +The functions below play a key role in allocating memory for domains. + + .. c:function:: int xc_domain_populate_physmap(xch, domid, extents, order, \ + mem_flags, extent_start) + + :param xch: The :term:`libxenctrl` interface + :param domid: The ID of the domain + :param extents: Number of extents + :param order: Order of the extents + :param mem_flags: Allocation flags + :param extent_start: Starting PFN + :type xch: xc_interface * + :type domid: uint32_t + :type extents: unsigned long + :type order: unsigned int + :type mem_flags: unsigned int + :type extent_start: xen_pfn_t * + :returns: 0 on success, or a negative error code on failure. + + This function is a wrapper for the ``XENMEM_populate_physmap`` hypercall, + which is handled by the :c:func:`populate_physmap()` function in the + hypervisor. It is used by :term:`libxenguest` for populating the + :term:`guest physical memory` of a domain. :term:`domain builders` can + set the :term:`NUMA node affinity` and pass the preferred node to this + function to steer allocations towards the preferred NUMA node(s) and let + :term:`claims` ensure that the memory will be available even in cases + of :term:`parallel domain builds` where multiple domains are being built + at the same time. + +The :term:`meminit` API calls :c:func:`xc_domain_populate_physmap()` +for populating the :term:`guest physical memory`. It invokes the restartable +``XENMEM_populate_physmap`` hypercall implemented by +:c:func:`populate_physmap()`. + +.. c:function:: void populate_physmap(struct memop_args *a) + + :param a: Provides status and hypercall restart info + :type a: struct memop_args * + + Allocates memory for building a domain and uses it for populating the + :term:`physmap`. For allocation, it uses + :c:func:`alloc_domheap_pages()`, which forwards the request to + :c:func:`alloc_heap_pages()`. + + During domain creation, it adds the :c:macro:`MEMF_no_scrub` flag to the request + for populating the :term:`physmap` to optimise domain startup by allowing + the use of unscrubbed pages. + + When that happens, it scrubs the pages as needed using hypercall + continuation to avoid long hypercall latency and watchdog timeouts. + + Domain builders can optimise on-demand scrubbing by running + :term:`physmap` population pinned to the domain's NUMA node, + keeping scrubbing local and avoiding cross-node traffic. + +.. c:function:: struct page_info *alloc_heap_pages(unsigned int zone_lo, \ + unsigned int zone_hi, \ + unsigned int order, \ + unsigned int memflags, \ + struct domain *d) + + :param zone_lo: The lowest zone index to consider for allocation + :param zone_hi: The highest zone index to consider for allocation + :param order: The order of the pages to allocate (2^order pages) + :param memflags: Memory allocation flags that may affect the allocation + :param d: The domain for which to allocate memory or NULL + :type zone_lo: unsigned int + :type zone_hi: unsigned int + :type order: unsigned int + :type memflags: unsigned int + :type d: struct domain * + :returns: The allocated page_info structure, or NULL on failure + + This function allocates a contiguous block of pages from the heap. + It checks claims and available memory before attempting the + allocation. On success, it updates relevant counters and redeems + claims as necessary. + + It first checks whether the request can be satisfied given the domain's + claims and available memory using :c:func:`claims_permit_request()`. + If claims and availability permit the request, it calls + :c:func:`get_free_buddy()` to find a suitable block of free pages + while respecting node and zone constraints. + + Simplified pseudocode of its logic: + +.. code:: C + + struct page_info *alloc_heap_pages(unsigned int zone_lo, + unsigned int zone_hi, + unsigned int order, + unsigned int memflags, + struct domain *d) { + /* D's claims and available memory need to permit the request. */ + if (!claims_permit_request(1UL << order, total_avail_pages, memflags, + NUMA_NO_NODE, d, outstanding_claims)) + return NULL; + + /* Find a suitable buddy block. Pass the zone range, order and + * memflags so the helper can apply node and zone selection. */ + pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d); + if (!pg) + return NULL; + + redeem_claims_for_allocation(d, 1UL << order, node_of(pg)); + update_counters_and_stats(d, order); + if (pg_has_dirty_pages(pg)) + scrub_dirty_pages(pg); + return pg; + } + +.. c:function:: struct page_info *get_free_buddy(zone_lo, zone_hi, order, \ + memflags, domain) + + :param zone_lo: The lowest zone index to consider for allocation + :param zone_hi: The highest zone index to consider for allocation + :param order: The order of the pages to allocate (2^order pages) + :param memflags: Flags for conducting the allocation + :param domain: domain to allocate memory for or NULL + :type zone_lo: unsigned int + :type zone_hi: unsigned int + :type order: unsigned int + :type memflags: unsigned int + :type domain: struct domain * + :returns: The allocated page_info structure, or NULL on failure + + This function finds a suitable block of free pages in the buddy + allocator while respecting claims and node-level available memory. + + Called by :c:func:`alloc_heap_pages()` after verifying the request is + permissible, it iterates over nodes and zones to find a buddy block + that satisfies the request. It checks node-local claims before + attempting allocation from a node. + + Using :c:func:`claims_permit_request()`, it checks whether the node + has enough unclaimed memory to satisfy the request or whether the + domain's claims can permit the request on that node after accounting + for outstanding claims. + + If the node can satisfy the request, it searches for a suitable block + in the specified zones. If found, it returns the block; otherwise it + tries the next node until all online nodes are exhausted. + + Simplified pseudocode of its logic: + +.. code:: C + + /* + * preferred_node_or_next_node() represents the policy to first try the + * preferred/requested node then fall back to other online nodes. + */ + struct page_info *get_free_buddy(unsigned int zone_lo, + unsigned int zone_hi, + unsigned int order, + unsigned int memflags, + const struct domain *d) { + nodeid_t request_node = MEMF_get_node(memflags); + + /* + * Iterate over candidate nodes: start with preferred node (if any), + * then try other online nodes according to the normal placement policy. + */ + while (there are more nodes to try) { + nodeid_t node = preferred_node_or_next_node(request_node); + unsigned long avail_pages = node_avail_pages[node] - + node_outstanding_claims[node] + + ((d && !(memflags & MEMF_no_refcount)) + ? d->claims[node] : 0); + + /* Ensure the target node and the claims permit can this allocation */ + if ( avail_pages < (1UL << order) ) + goto next_node; + + /* Find a zone on this node with a suitable buddy */ + for (int zone = highest_zone; zone >= lowest_zone; zone--) + for (int j = order; j <= MAX_ORDER; j++) + if ((pg = remove_head(&heap(node, zone, j))) != NULL) + return pg; + next_node: + if (request_node != NUMA_NO_NODE && (memflags & MEMF_exact_node)) + return NULL; + /* Fall back to the next node and repeat. */ + } + return NULL; + } + +.. note:: The actual implementation includes additional details + but the pseudocode captures the core logic of checking claims + and available memory while searching for a suitable buddy. + +************************************** +Offlining memory in presence of claims +************************************** + +When offlining pages, Xen must ensure that available memory on a node +and the total number of free pages does not fall below their respective +outstanding claims. If it does, Xen recalls claims from domains until +accounting is valid again. + +This is triggered by privileged domains via the +``XEN_SYSCTL_page_offline_op`` sysctl or by machine-check memory errors. + +Offlining currently allocated pages cannot remove those in-use pages from +circulation. They are marked for offlining and are offlined when freed back +to the allocator. However, when already free pages are directly offlined, +free memory the outstanding claims may need to be adjusted directly too. + +:c:func:`reserve_offlined_page()` needs to check whether offlining the page +causes :c:var:`total_avail_pages` to fall below :c:var:`outstanding_claims` or +:c:expr:`node_avail_pages[page->node]` to fall below +:c:expr:`node_outstanding_claims[page->node]`. If so, +:c:func:`reserve_offlined_page()` must look for domains with relevant claims +and recall those claims until the claim accounting is valid again. + +- When + :c:expr:`node_outstanding_claims[page->node]` exceeds + :c:expr:`node_avail_pages[page->node]` for the offlined page, + :c:func:`reserve_offlined_page()` should call + :c:func:`domain_release_node_claims()` + to recall claims on that node from domains with claims on the node of the + offlined buddy until the claim accounting of the node is valid again. + +- When total :c:var:`outstanding_claims` exceeds :c:var:`total_avail_pages`, + :c:func:`reserve_offlined_page()` calls + :c:func:`domain_release_host_claims()` to recall host-wide claims + from domains until the overall claims accounting is valid again. + +This can violate claim guarantees, but it is necessary to maintain system +stability when memory must be offlined. + +.. c:function:: int reserve_offlined_page(struct page_info *head) + + :param head: The page being offlined + :type head: struct page_info * + :returns: 0 on success, or a negative error code on failure. + + This function is called during the offlining process to offline pages. + + If offlining a page causes available memory to fall below outstanding + claims, it checks the node-specific and host-wide claim accounting + and recalls claims from domains as necessary to ensure accounting + invariants hold after a buddy is offlined. diff --git a/docs/designs/claims/index.rst b/docs/designs/claims/index.rst new file mode 100644 index 000000000000..218632c6e22f --- /dev/null +++ b/docs/designs/claims/index.rst @@ -0,0 +1,48 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +NUMA Claims +=========== + +Design and implementation of NUMA-aware claim sets. + +Status: Draft for review + +This design first introduces the external behaviour of claim sets: how claims +are installed, how they protect allocations, and how they are redeemed. +It then covers the underlying accounting model and implementation details. + +For readers following the design in order, the next sections cover these +topics: + +1. :doc:`/designs/claims/use-cases` describes the use cases for claim sets. +2. :doc:`/designs/claims/performance` describes the performance test results. +3. :doc:`/designs/claims/development` provides the development history and future work. +4. :doc:`/designs/claims/design` introduces the overall model and goals. +5. :doc:`/designs/claims/installation` describes how claim sets are installed. +6. :doc:`/designs/claims/protection` describes how claimed memory is + protected during allocation. +7. :doc:`/designs/claims/redeeming` explains how claims are redeemed when + allocations succeed. +8. :doc:`/designs/claims/accounting` describes the accounting model that + underpins those steps. +9. :doc:`/designs/claims/implementation` documents the functions used for the + implementation. +10. :doc:`/designs/claims/terminology` defines the terms used in this design. + +.. toctree:: :caption: Contents + :maxdepth: 2 + + use-cases + performance + development + design + installation + protection + redeeming + accounting + implementation + terminology + +.. contents:: + :backlinks: entry + :local: diff --git a/docs/designs/claims/installation.rst b/docs/designs/claims/installation.rst new file mode 100644 index 000000000000..2073da2c33ee --- /dev/null +++ b/docs/designs/claims/installation.rst @@ -0,0 +1,70 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +Installation +############ + +********** +Claim sets +********** + +A claim set is an array of :c:type:`memory_claim_t` entries. + +.. c:type:: memory_claim_t + + The ``typedef`` for :c:type:`xen_memory_claim`, used for + passing an array of claim set entries to the hypervisor. + +.. c:struct:: xen_memory_claim + + Underlying structure for passing claim sets to the hypervisor. + + This structure represents an individual claim entry in a claim set. + It specifies the number of pages claimed and the target of the claim, + which can be a specific NUMA node or a special value for host-wide claims. + + The structure includes padding for future expansion. It is important to + zero-initialise it or use designated initialisers to ensure forward + compatibility. Members are as follows: + + .. c:member:: uint64_aligned_t pages + + Number of pages for this claim entry. + + .. c:member:: uint32_t cmd + + Command field reserved for future use. It must be initialised to 0 + for forward compatibility. + + .. c:member:: uint32_t target + + + The target of the claim entry. It can be a special selector, which could + in the future include flags and additional information, or simply a NUMA + node ID. + + See :ref:`guest-guide/dom/DOMCTL_claim_memory:Hypercall API` + for the defined special selectors and their semantics. + +.. c:type:: uint64_aligned_t + + 64-bit unsigned integer type with alignment requirements suitable for + representing page counts in the claim structure. + +********************** +Claim set installation +********************** + + +Claim set installation is invoked via :c:macro:`XEN_DOMCTL_claim_memory`, and +:c:func:`domain_install_claim_set()` implements the claim set installation logic. + +See :doc:`accounting` for details on the claims accounting state. + +************************* +Legacy claim installation +************************* + +Legacy claims are set via the :c:macro:`XENMEM_claim_pages` hypercall command. + +.. note:: The legacy path is deprecated. + Use :c:macro:`XEN_DOMCTL_claim_memory` for new code. diff --git a/docs/designs/claims/invariants.mmd b/docs/designs/claims/invariants.mmd new file mode 100644 index 000000000000..317c51536ed3 --- /dev/null +++ b/docs/designs/claims/invariants.mmd @@ -0,0 +1,35 @@ +%% SPDX-License-Identifier: CC-BY-4.0 +%% Claim variables and their Invariants +flowchart TD + +subgraph "Access under the <tt><b>heap_lock</b></tt> only:" + direction TB + Memory_of_Nodes --" Contribute to "--> Overall_Memory + Overall_Memory --" Available to "--> Memory_of_Domains +end + +subgraph Memory_of_Nodes["Per-node claims and available memory"] + direction LR + per_node_claims -->|" less or equal to "| node_avail_pages + per_node_claims["Claims on the node: + <tt>node_outstanding_claims[n]"] + node_avail_pages["Available pages on the node: + <tt>node_avail_pages[n]"] +end + +subgraph Overall_Memory["Overall claims and available memory"] + direction LR + outstanding -->|" less or equal to "| avail_pages + outstanding["Total claims on the host: + <tt>outstanding_claims"] + avail_pages["Available pages on the host: + <tt>total_avail_pages"] +end + +subgraph Memory_of_Domains["Per-domain claims and available memory"] + direction LR + claims -->|" less or equal to "| available_memory_for_domains + claims["Claims of the domain:<br><tt>d->outstanding_pages"] + available_memory_for_domains["Available pages:<br><tt>node_avail_pages[n] + total_avail_pages"] +end diff --git a/docs/designs/claims/performance.rst b/docs/designs/claims/performance.rst new file mode 100644 index 000000000000..694c97ca3321 --- /dev/null +++ b/docs/designs/claims/performance.rst @@ -0,0 +1,33 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +Performance +*********** + +The single-node claims implementation which is the basis of the +NUMA claims v4 series and the multi-node claim sets design forms +the groundwork for the NUMA design and implementation in XenServer 9. + +An early version of it is available as the XenServer XS9 preview +release: https://www.xenserver.com/downloads/xs9-preview. +The performance of this release has been tested in real +customer environments with customer workloads. + +On dual-socket Intel servers, the **average aggregate CPU usage across +all VMs at peak times** (peak user load) was **~16% less** than with +`XenServer 8.4` (overall average at all times **~8.5% less**) compared +to the previous release, which is a significant improvement in CPU +efficiency for memory-intensive workloads, attributed to the +improved NUMA placement enabled by `NUMA-aware claims`. + +The customer's response time metric from their application, which is the +key measure the customer uses for end user observed performance, showed +an ~8% improvement, matching the improvement in average CPU usage. + +These numbers were observed using `Intel dual-socket servers`. +The performance benefits with AMD servers (judging by preliminary tests) are +expected to be considerably higher than the results with dual-socket Intel +servers. + +The multi-node claim sets design is expected to extend these benefits +to configurations that require claiming memory from multiple NUMA nodes +adjacent to each other for optimal performance. diff --git a/docs/designs/claims/protection.rst b/docs/designs/claims/protection.rst new file mode 100644 index 000000000000..c7eec95b99e4 --- /dev/null +++ b/docs/designs/claims/protection.rst @@ -0,0 +1,200 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +Protection +########## + +.. contents:: Table of Contents + :backlinks: entry + :local: + +Claimed memory must be protected from allocations without applicable claims +while remaining available to allocations with applicable claims. + +Claims exist as long as they are outstanding, which is from the moment they +are installed until they are redeemed by allocations. + +During this time, they are a commitment of memory to a domain, and the +hypervisor must ensure that this commitment is respected by protecting +claimed memory from being allocated without redeeming applicable claims. + +Redeeming claims is the process of applying a portion of the claims of +a domain to an allocation to allow the allocation to proceed by exchanging +the claim for the allocated memory, so that the allocation can use the +claimed memory and the portion of the claim used for the allocation is +no longer outstanding. + +For example, if a domain has an outstanding claim of 100 pages on a node, +and it redeems 20 pages of that claim for an allocation, the domain would +have 80 pages of that claim still outstanding, and the allocation would be +satisfied using the claimed memory, so the domain can use that allocated +memory and the claim would be reduced by the redeemed amount. + +For the protection of claims, the allocator performs checks to ensure that +claimed memory is not allocated without redeeming applicable claims, while +still allowing the claiming domain to allocate claimed memory by redeeming +claims. + +When the system is not under heavy memory pressure and not fully-claimed, +the allocator can satisfy allocation requests using unclaimed memory. + +However, when the system is under heavy memory pressure or nearly fully-claimed, +the checks for protecting claims become critical to ensure that claimed memory +is not allocated without redeeming applicable claims. + +********************************* +Reference-counting of allocations +********************************* + +Claims protection distinguishes between two kinds of allocation requests. + +Reference-counted requests +========================== + +This means that the request comes for a domain and the :c:expr:`memflags` +of the request do not include :c:expr:`MEMF_no_refcount`. + +In this case, the request is reference-counted to the domain's +total memory allocation, and the domain's claims can be used +to protect and redeem the allocation using claims. + +For example, the allocation requests by :term:`domain builders` for the +:term:`guest physical memory` of domains are always reference-counted, +and as such, can be protected and redeemed by claims to the extent +the claims are applicable and sufficient for the allocation. + +Not reference-counted requests +============================== + +This means that the request is not for a domain, or the :c:type:`memflags` +of the request includes :c:macro:`MEMF_no_refcount`. + +In this case, the request is not reference-counted to a domain's +memory allocation state, and as part of that, claims of a domain +cannot be used to protect and redeem the allocation using claims. + +As such, the allocation request is not protected and redeemed by claims and +the allocator does not consider claims to check whether the request can +be satisfied, so the request can only be satisfied using unclaimed memory. + +Therefore, such requests can only be satisfied using unclaimed memory. + +Callers using MEMF_no_refcount +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Example callers which use :c:macro:`MEMF_no_refcount` when allocating memory +or use :c:macro:`MEMF_no_owner` which also sets :c:macro:`MEMF_no_refcount` +in the context of domains include: + +- ``p2m_alloc_page()`` for allocating pages for the page-to-machine mapping. +- ``hap_set_allocation()`` for allocating memory for hardware-assisted paging. +- ``vmx_alloc_vlapic_mapping()`` for allocating the vLAPIC page for a HVM guest. +- ``vmtrace_alloc_buffer()`` for allocating the buffer for VM tracing. +- ``ioreq_server_alloc_mfn()`` for allocating memory for I/O requests. + +Example actions happening at runtime on the request of running domains +which use :c:macro:`MEMF_no_refcount` or :c:macro:`MEMF_no_owner` to +bypass reference-counting include: + +- ``memory_exchange()`` for exchanging memory pages of a domain. +- ``gnttab_transfer()`` for transferring pages between domains. + +*********************** +Claim protection checks +*********************** + +Unless the request is an exact-node request for a node-specific claim, +the allocator performs two protective checks to protect claimed memory +from being allocated to other domains while still allowing the claiming +domain to allocate it. + +Before starting, the allocator takes the global :c:var:`heap_lock`. +This ensures that any previous changes to the state of the system's +unclaimed memory and the domain's total outstanding claims are complete +and visible, and no concurrent changes to those values can happen. + +Protection of host-wide claims +============================== + +The first check [1]_ the allocator performs is a check protecting host-wide +claims which are part of the total pool of the claims on the entire host. + +1. Get the total amount of unclaimed memory available in the system. + It is the sum of the free pages on all NUMA nodes + (:c:var:`total_avail_pages`) minus the total amount of claimed + memory across all domains (:c:var:`outstanding_claims`). This + includes all host-wide claims and all node-specific claims. + +2. Check whether the request can be satisfied by the unclaimed memory itself. + + If so, the allocation calls :c:func:`get_free_buddy()` to perform the + node-specific checks and find free pages on the appropriate node(s) + to satisfy the request. + + This is the common case, especially for smaller allocations and when the + host is not under heavy memory pressure and not fully-claimed. + +If the request cannot proceed based on the unclaimed memory, it is under +heavy memory pressure as the unclaimed memory is very low, which is where +the protection of claims becomes critical. + +In these situations, the allocator needs to ensure that the domain has +enough claims to redeem the claimed memory to satisfy this request, +otherwise the request has to fail: + +1. If the request is not for a domain or the request is disabling reference + counting, the request fails. + +2. If the total claims of the domain (:c:member:`domain.outstanding_claims`) + cover the amount of claims needed to satisfy the request, + the allocation can proceed further. Else, the request fails. + +Protection of node-specific claims +================================== + +This check protects claimed memory on the specific node from being allocated +without sufficient claims. + +After passing the host-wide claims protection check, the allocator calls +:c:expr:`get_free_buddy()` to pick nodes for allocation and check the +node's suitability [2]_ for this request: + +1. Get the number of unclaimed memory available on that node using the + free pages on that NUMA node (``node_avail_pages[node]``) minus the + total amount of claimed memory across all domains for that node + (``node_outstanding_claims[node]``). + +2. If the request can be satisfied by the sum of the unclaimed memory + on that node and the claims of the domain for that node, the allocation + can proceed on that node, else this node cannot satisfy this request. + +3. If the allocation is an exact-node request, or the allocator + has no further nodes to consider, the allocation fails. + +4. Else, if the allocator has to consider further nodes for this request, + the allocator continues to repeat the same process for the next node. + +.. rubric:: Footnotes + +.. [1] In principle, the host-wide check for the protection of host-wide claims + could be skipped for node-exact requests that are reference-counted and + covered by the claims of the domain for that node. The added code for + This additional check would add complexity to the code, and as long as + Xen must track global memory counters, those counters would still need + to be accessed for all requests, so the added code could only delay the + access to those global counters while adding more checks to all other + requests. Therefore, that's not considered beneficial for now. + + However, if we want to replace the global :c:var:`heap_lock` serving + as a global synchronisation point for all memory allocations with + finer-grained (per-node) locks in the future, then this check could be + added to allow more concurrency for node-exact allocations (and all + free_page() calls) while still protecting claims, but that would be a + future project, requiring significant changes to the code. + +.. [2] If the request is reference-counted and the request is covered by + the claims of the domain for that node, the request could proceed. + But that would add complexity to the code, and as long as Xen must track + per-node memory counters, those counters would still need to be updated + for all allocations from this node, so the added code could only delay + the access to those per-node counters while adding more checks to all + other requests. Therefore, that's not considered beneficial for now. diff --git a/docs/designs/claims/redeeming.rst b/docs/designs/claims/redeeming.rst new file mode 100644 index 000000000000..a5eb045c1bce --- /dev/null +++ b/docs/designs/claims/redeeming.rst @@ -0,0 +1,71 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +Redeeming +######### + +.. contents:: Table of Contents + :backlinks: entry + :local: + +After the buddy allocator returned the pages for the allocation, +:c:func:`redeem_claims_for_allocation()` redeems claims up to the size of +the allocation in the same critical region that updates the free-page counters. + +The function performs the following steps to redeem the matching +claims for this allocation. It ensures that the domain's total memory +allocation as :c:func:`domain_tot_pages` plus its outstanding +claims as :c:member:`domain.outstanding_pages` remain within the +domain's limits, defined by :c:member:`domain.max_pages`: + +Steps to redeem claims for an allocation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Step 1: + Redeem claims from :c:expr:`domain.claims[alloc_node]` on the allocation + node, up to the size of that claim. +Step 2: + If the allocation exceeds :c:expr:`domain.claims[alloc_node]`, redeem the + remaining pages from the host-wide claims + (:c:member:`domain.outstanding_pages` - :c:member:`domain.node_claims`), + up to the size of the host-wide claims. +Step 3: + If the allocation exceeds the combination of those claims, redeem the + remaining pages from other per-node claims so that the domain's total + allocation plus claims remain within the domain's :c:member:`domain.max_pages` + limit. + +Enforcing the domain's max_pages limit +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:c:func:`domain_tot_pages` + :c:member:`domain.outstanding_pages` +must not exceed the :c:member:`domain.max_pages` limit, otherwise +the domain could exceed its memory entitlement. + +At claim installation time, :c:func:`domain_install_claim_set()` performs +this check. + +.. :sidebar:: + See :ref:`designs/claims/accounting:Locking of claims accounting` + for the locks used to protect claims accounting state and invariants. + +At memory allocation time + If (unexpectedly) a domain builder ends up allocating memory from + different nodes than it claimed from, the domain's total allocation + plus claims could exceed the domain's :c:member:`domain.max_pages` + limit, unless the page allocator redeems claims from other nodes + to ensure the sum of the domain's claims and populated pages + remains within the :c:member:`domain.max_pages` limit. + + :c:func:`redeem_claims_for_allocation()` + cannot reliably check :c:member:`domain.max_pages` race-free because + :c:member:`domain.max_pages` is not protected by the :c:var:`heap_lock` + taken by the page allocator during allocation. + + To check the domain's limits, it would have to take the + :c:member:`domain.page_alloc_lock` to inspect the domain's + limits and its current allocation. However, taking that lock + while holding the :c:var:`heap_lock` would invert the locking + order and could lead to deadlocks. + + Therefore, :c:func:`redeem_claims_for_allocation()` + redeems the remaining allocation from other-node claims in Step 3. diff --git a/docs/designs/claims/terminology.rst b/docs/designs/claims/terminology.rst new file mode 100644 index 000000000000..62bc32ae93b5 --- /dev/null +++ b/docs/designs/claims/terminology.rst @@ -0,0 +1,138 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +Terminology +########### + +.. Terms should appear in alphabetical order by their initial synonym. + +.. glossary:: + + claims + Reservations of memory for :term:`domains` that are installed by + :term:`domain builders` before :term:`populating` the domain's memory. + Claims ensure that the reserved memory remains available for the + :term:`domains` when allocating it, even if other :term:`domains` are + allocating memory at the same time. + + claim set + An array of :c:type:`memory_claim_t` entries, each specifying a page count + and a target (either a NUMA node ID or a special value for host-wide claims), + that can be installed atomically for a domain to reserve memory on multiple + NUMA nodes. The chapter on :ref:`designs/claims/installation:claim sets` + provides further information on the structure and semantics of claim sets. + + claim set installation + installing claim sets + installing claims + The process of validating and installing a claim set for a domain under + :c:member:`domain.page_alloc_lock` and :c:var:`heap_lock`, ensuring that + either the entire set is accepted and installed, or the request fails with + no side effects. + The chapter on :ref:`designs/claims/installation:claim set installation` + provides further information on the structure and semantics of claim sets. + + domain builders + Privileged entities (such as :term:`toolstacks` in management :term:`domains`) + responsible for constructing and configuring :term:`domains`, including + installing :term:`claims`, :term:`populating` memory, and setting up other + resources before the :term:`domains` are started. + + host-wide claims + :term:`claims` that can be satisfied from any NUMA node, required for + compatibility with existing domain builders and for use cases where + strict node-local placement is not required or not possible, such as on + UMA machines or as a fallback for memory that comes available on any node. + + libxenctrl + A low-level C API library to interact with the Xen hypervisor, to make + :term:`hypercalls`. If hypercalls are to Xen what system calls are to the + Linux kernel, then :term:`libxenctrl` is the universal, low-level system C + runtime library that provides the interface for making those hypercalls. + + libxenguest + A higher-level library, layered on top of :term:`libxenctrl`, + specifically designed for :term:`domain builders` to build and + configure :term:`domains`, including installing :term:`claims` + and :term:`populating` :term:`guest physical memory`. It provides + a more convenient and domain-builder-friendly interface for these + operations, abstracting away details of creating the architecture-specific + memory map expected by guest operating systems which were initially + written to run on the bare metal (on full hardware) and not in a + virtualized environment. + + meminit + The phase of a domain build where the guest's physical memory is populated, + which involves allocating and mapping physical memory for the domain's guest + :term:`physmap`. This should be performed after installing :term:`claims` + to protect the process against parallel allocations of other domain builder + processes in case of parallel domain builds. + + It is implemented in :term:`libxenguest` and optionally installs + :term:`claims` to ensure the claimed memory is reserved before populating + the :term:`physmap` using calls to :c:func:`xc_domain_populate_physmap()`. + + nodemask + A bitmap representing a set of NUMA nodes, used for status information + like :c:var:`node_online_map` and the :c:member:`domain.node_affinity`. + + node + NUMA node + NUMA nodes + A grouping of CPUs and memory in a NUMA architecture. NUMA nodes have + varying access latencies to memory, and NUMA-aware claims allow + :term:`domain builders` to reserve memory on specific NUMA nodes + for performance reasons. Platform firmware configures what constitutes + a NUMA node, and Xen relies on that configuration for NUMA-related features. + + When this design refers to NUMA nodes, it is referring to the NUMA nodes + as defined by the platform firmware and exposed to Xen, initialized at boot + time and not changing at runtime (so far). + + The NUMA node ID is a numeric identifier for a NUMA node, used whenever code + specifies a NUMA node, such as the target of a claim or indexing into arrays + related to NUMA nodes. + + NUMA node IDs start at 0 and are less than :c:macro:`MAX_NUMNODES`. + + Some NUMA nodes may be offline, and the :c:var:`node_online_map` is used + to track which nodes are online. Currently, Xen does not support hotplug + of NUMA nodes, so the set of online NUMA nodes is determined at boot time + based on the platform firmware configuration and does not change at runtime. + + NUMA node affinity + The preference of a :term:`domain` for a set of NUMA nodes, which can + be set up by :term:`domain builders` to make :c:func:`get_free_buddy` + (which selects the NUMA node to allocate from) prefer specific NUMA nodes for + performance reasons. + + It is represented by the :c:member:`domain.node_affinity`, which is a + bitmap of NUMA nodes indicating the preferred NUMA nodes for the domain. + By default, domains have NUMA node auto-affinity, which means their NUMA + node affinity is determined automatically by the hypervisor based on the + CPU affinity of their vCPUs, but it can be disabled and configured manually + by domain builders. + + guest physical memory + physmap + The mapping of a domain's guest physical memory to the host's + machine address space. The :term:`physmap` defines how the guest's + physical memory corresponds to the actual memory locations on the host. + + populating + The process of allocating and mapping physical memory for a domain's guest + :term:`physmap`, performed by the :term:`domain builders`, preferably after + installing :term:`claims` to protect the process against parallel allocations + of other domain builder processes in case of parallel domain builds. + + toolstacks + Privileged entities (running in privileged :term:`domains`) responsible for + managing :term:`domains`, including building, configuring, and controlling + their lifecycle using :term:`domain builders`. One toolstack may run + multiple :term:`domain builders` in parallel to build multiple :term:`domains` + at the same time. + + Xenctrl + An OCaml library provided by Xen for :term:`domain builders` running + in privileged :term:`domains` to interact with the hypervisor, including + making hypercalls to install claims and :term:`populating` + :term:`guest physical memory`. \ No newline at end of file diff --git a/docs/designs/claims/use-cases.rst b/docs/designs/claims/use-cases.rst new file mode 100644 index 000000000000..5a618f0d0280 --- /dev/null +++ b/docs/designs/claims/use-cases.rst @@ -0,0 +1,39 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +######### +Use Cases +######### + +.. glossary:: + + Parallel :term:`domain builds` + + When many domains need to be created and built, many :term:`domain builders` + compete for the same pools of memory, which can lead to inefficient NUMA + placement of :term:`guest physical memory` and thus suboptimal performance + for the domains. + + NUMA-aware claims can help solve this problem and ensure that memory + is available on the appropriate NUMA nodes. + + Domain builds + + The process of constructing and configuring :term:`domains` by + :term:`domain builders`, which includes installing :term:`claims`, + :term:`populating` memory, and setting up other resources before the + :term:`domains` are started. When multiple :term:`domain builders` can + run in parallel, this is referred to as parallel domain builds, which can + benefit from NUMA-aware claims because the domain builders are competing for + the same pools of memory on the NUMA nodes. + + Boot storms + + It is common for many domains to be booted at the same time, such as during + system startup or when large numbers of domains need to be started. + + Parallel migrations + + Similar to :term:`boot storms`, except that the domains are being migrated + instead of booted, which can happen when other hosts are being drained + for maintenance (host evacuation) or when workloads are being rebalanced + across hosts. diff --git a/docs/designs/index.rst b/docs/designs/index.rst index 54d45c2bc321..1f4537957827 100644 --- a/docs/designs/index.rst +++ b/docs/designs/index.rst @@ -14,3 +14,4 @@ and for those interested in the internal workings of Xen. launch/index cache-coloring + claims/index diff --git a/docs/glossary.rst b/docs/glossary.rst index 5c3229a8c4fd..f73de9b85cf6 100644 --- a/docs/glossary.rst +++ b/docs/glossary.rst @@ -12,6 +12,7 @@ Glossary to create and manage other domains on the system. domain + domains A domain is Xen's unit of resource ownership, and generally has at the minimum some RAM and virtual CPUs. @@ -58,13 +59,18 @@ Glossary In the code, "guest context" and "guest state" is considered in terms of the CPU architecture, and contrasted against hypervisor context/state. - In this case, it refers to all code running lower privilege privilege - level the hypervisor. As such, it covers all domains, including ones + In this case, it refers to all code running lower privilege level than + the hypervisor. As such, it covers all domains, including ones providing system services. hardware domain A :term:`domain`, commonly dom0, which shares responsibility with Xen about the system as a whole. - By default it gets all devices, including all disks and network cards, so + By default, it gets all devices, including all disks and network cards, and is responsible for multiplexing guest I/O. + + hypercall + hypercalls + A mechanism for a :term:`guest` to request services from the hypervisor. + Hypercalls are analogous to system calls in a traditional operating system. diff --git a/docs/guest-guide/dom/DOMCTL_claim_memory-data.mmd b/docs/guest-guide/dom/DOMCTL_claim_memory-data.mmd new file mode 100644 index 000000000000..8d45322ba939 --- /dev/null +++ b/docs/guest-guide/dom/DOMCTL_claim_memory-data.mmd @@ -0,0 +1,43 @@ +%% SPDX-License-Identifier: CC-BY-4.0 +classDiagram +class do_domctl["Args passed to <tt>do_domctl()</tt>"] { + +uint32_t cmd: XEN_DOMCTL_claim_memory + +uint32_t domain: Domain ID + +xen_domctl_claim_memory: Claim set +} +class xen_domctl_claim_memory["Claim set passed to <tt>do_domctl()</tt>"] { + +memory_claim_t* claims: Claim entries + +uint32_t nr_claims: Number of claim entries + +uint32_t pad: always 0 for future use +} +class memory_claim_t["Claim set: Array of claim entries"] { + +pages: Pages to claim + +node: Claim selector or node + +pad: always 0 for future use +} +class xc_domain_claim_memory["xc_domain_claim_memory()"] { + +xc_interface* xch + +uint32_t domid + +uint32_t nr_claims + +memory_claim_t* claims +} +class outstanding_pages["Total claims of domains"] { + global free = total_avail_pages - outstanding_claims + node free = node_avail_pages[node] - node_outstanding_claims[node] +} +class claim["XEN_DOMCTL_claim_memory"] { + +domain_set_outstanding_pages() + +domain_set_node_claims() +} +class domain["Claim fields in struct domain"] { + +outstanding_pages - Total outstanding claims of the domain + +node_claims - Sum of claims on all nodes of the domain + +claims[] - Array of claims on specific nodes +} +xen_domctl_claim_memory o--> memory_claim_t +do_domctl o--> xen_domctl_claim_memory +xc_domain_claim_memory ..> do_domctl: passes<br> <tt>Claim set</tt> +xc_domain_claim_memory ..> claim : calls <tt>do_domctl()</tt> +claim ..> xen_domctl_claim_memory : reads +claim ..> domain : sets +domain ..> outstanding_pages : updates outstanding claims diff --git a/docs/guest-guide/dom/DOMCTL_claim_memory-seqdia.mmd b/docs/guest-guide/dom/DOMCTL_claim_memory-seqdia.mmd new file mode 100644 index 000000000000..10ed8f4aa094 --- /dev/null +++ b/docs/guest-guide/dom/DOMCTL_claim_memory-seqdia.mmd @@ -0,0 +1,23 @@ +%% SPDX-License-Identifier: CC-BY-4.0 +sequenceDiagram + +actor DomainBuilder +participant OcamlStub as OCaml stub for<br>xc_domain<br>claim_memory +participant Libxc as xc_domain<br>claim_memory +participant Domctl as XEN_DOMCTL<br>claim_memory +#participant DomainLogic as claim_memory +participant Alloc as domain<br>set<br>outstanding_pages + +DomainBuilder->>OcamlStub: claims +OcamlStub->>OcamlStub: marshall claims -----> OCaml to C +OcamlStub->>Libxc: claims + +Libxc->>Domctl: do_domctl + +Domctl->>Domctl: copy_from_guest(claim) +Domctl->>Domctl: validate claim +Domctl->>Alloc: set<br>outstanding_pages +Alloc-->>Domctl: result +Domctl-->>Libxc: rc +Libxc-->>OcamlStub: rc +OcamlStub-->>DomainBuilder: claim_result diff --git a/docs/guest-guide/dom/DOMCTL_claim_memory-workflow.mmd b/docs/guest-guide/dom/DOMCTL_claim_memory-workflow.mmd new file mode 100644 index 000000000000..372f2bb7a616 --- /dev/null +++ b/docs/guest-guide/dom/DOMCTL_claim_memory-workflow.mmd @@ -0,0 +1,23 @@ +%% SPDX-License-Identifier: CC-BY-4.0 +sequenceDiagram + +participant Toolstack +participant Xen +participant NUMA Node memory + +Toolstack->>Xen: XEN_DOMCTL_createdomain +Toolstack->>Xen: XEN_DOMCTL_max_mem(max_pages) + +Toolstack->>Xen: XEN_DOMCTL_claim_memory(pages, node) +Xen->>NUMA Node memory: Claim pages on node +Xen-->>Toolstack: Claim granted + +Toolstack->>Xen: XEN_DOMCTL_set_nodeaffinity(node) + +loop Populate domain memory + Toolstack->>Xen: XENMEM_populate_physmap(memflags:node) + Xen->>NUMA Node memory: alloc from claimed node +end + +Toolstack->>Xen: XEN_DOMCTL_claim_memory(0, NO_NODE) +Xen-->>Toolstack: Remaining claims released diff --git a/docs/guest-guide/dom/DOMCTL_claim_memory.rst b/docs/guest-guide/dom/DOMCTL_claim_memory.rst new file mode 100644 index 000000000000..c0d0070a0c58 --- /dev/null +++ b/docs/guest-guide/dom/DOMCTL_claim_memory.rst @@ -0,0 +1,221 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +claim_memory +************ + + .. c:macro:: XEN_DOMCTL_claim_memory + + Hypercall command for installing claim sets for a domain. + + This command allows :term:`domain builders` to install a :term:`claim set` + for a domain, which the Xen hypervisor tracks and enforces during memory + allocation. + + The claimed memory is protected from other allocations and the domain's + memory requirements can be met even when other parallel domain builders + are also allocating memory for other domains in parallel. + + :ref:`designs/claims/installation:Claim set installation` describes how the + hypervisor processes the claim sets installed via this hypercall command. + +Hypercall API +------------- + +See :ref:`designs/claims/installation:Claim sets` +for more details on the claim sets data structure. + +Definitions +^^^^^^^^^^^ + +Mode +~~~~ + .. c:macro:: XEN_DOMCTL_CLAIM_MEMORY_SET + + Install the given claim set for the domain. + + .. c:macro:: XEN_DOMCTL_CLAIM_MEMORY_GET + + Retrieve the claim set for the current claims of the domain. + +Target selectors +~~~~~~~~~~~~~~~~ + .. c:macro:: XEN_DOMCTL_CLAIM_MEMORY_HOST + + Special target selector for host-wide claims, + which can be satisfied from any NUMA node. + + .. c:macro:: XEN_DOMCTL_CLAIM_MEMORY_LEGACY + + Special target selector for legacy claims, which is interpreted as the + total memory target for the domain, with existing allocations subtracted + from it to determine the domain's new total host-wide outstanding claim. + It is provided for compatibility with existing :term:`domain builders` + and can only be used in a single-entry claim set. + +domctl.h structure +^^^^^^^^^^^^^^^^^^ + + .. code-block:: C + + struct xen_memory_claim { + uint64_aligned_t pages; /* Number of pages to claim */ + uint32_t target; /* NUMA node or claim type like legacy or host-wide */ + uint32_t cmd; /* Command reserved for future use, initialize to 0 */ + }; + typedef struct xen_memory_claim memory_claim_t; + DEFINE_XEN_GUEST_HANDLE(memory_claim_t); + + /* Special claim targets for the target field of memory_claim_t */ + #define XEN_DOMCTL_CLAIM_MEMORY_HOST 0x80000000U /* Host-wide claims */ + #define XEN_DOMCTL_CLAIM_MEMORY_LEGACY 0x40000000U /* Legacy semantics */ + + /* + * XEN_DOMCTL_claim_memory + * + * Install a claim set to claim memory for a guest domain. Claims work like + * tickets in exchange for allocating memory for a domain later. + */ + struct xen_domctl_claim_memory { + /* IN/OUT: Array of struct xen_memory_claim */ + XEN_GUEST_HANDLE_64(memory_claim_t) claim_set; + /* IN/OUT: Number of records in the claim_set array handle. */ + uint32_t nr_entries; + uint32_t mode; + #define XEN_DOMCTL_CLAIM_MEMORY_GET 0U /* Get a claim set for the domain. */ + #define XEN_DOMCTL_CLAIM_MEMORY_SET 1U /* Set a claim set for the domain. */ + }; + + +C API by libxenctrl +------------------- + + .. c:function:: int xc_domain_claim_memory(xch, domid, mode, nr_entries, \ + claim_set) + + :param xch: The :term:`libxenctrl` interface to use for the hypercall + :param domid: The ID of the domain for which to install the claim set + :param mode: The mode for the claim set installation + :param nr_entries: The number of entries in the claim set + :param claim_set: The claim set to install for the domain + :type xch: xc_interface * + :type domid: uint32_t + :type mode: uint32_t + :type nr_entries: uint32_t * + :type claim_set: memory_claim_t * + :returns: 0 on success, or a negative error code on failure. + + C API function for installing or retrieving claim sets for a domain + using the :expr:`XEN_DOMCTL_claim_memory` hypercall command. + + This function allows :term:`domain builders` to install a + :term:`claim set` for a domain, which the Xen hypervisor + tracks and enforces during memory allocation and can also + be used to retrieve the current claim set for a domain. + + When mode is :c:macro:`XEN_DOMCTL_CLAIM_MEMORY_SET`, the former mode + is used, where the function validates and installs the given claim set. + ``nr_entries`` specifies the number of entries in the ``claim_set`` array, + and ``claim_set`` points to the array of :c:type:`memory_claim_t` entries. + + When mode is :c:macro:`XEN_DOMCTL_CLAIM_MEMORY_GET`, the function + retrieves the current claim set into the memory pointed to by ``claim_set``. + The number of claims retrieved is stored in the variable pointed to by + ``nr_entries``. + + This function is part of the :term:`libxenctrl` library. + + Corresponding OCaml bindings are also available for this function in the + :term:`Xenctrl` OCaml library, providing a convenient interface for OCaml + :term:`domain builders` to install claim sets for a domain. + +C API Usage example +^^^^^^^^^^^^^^^^^^^ + + The example below shows how a domain builder can install a claim set and + later replace or clear it. :c:expr:`memory_claim_t` contains an additional + field for future expansion; zero-initialise the structure or use designated + initializers to ensure forward compatibility. + + .. code-block:: C + + #include <xenctrl.h> + + void install_example_claims(xc_interface *xch, uint32_t domid) + { + /* + * Claim 1024 pages on node 0, 1024 pages on node 1, and by setting + * the total claim target to 3072 pages, an additional host-wide claim of + * 1024 pages which is never bound to any specific node is also installed. + */ + memory_claim_t claims[] = { + {.pages = 1024, .target = 0}, + {.pages = 1024, .target = 1}, + {.pages = 1024, .target = XEN_DOMCTL_CLAIM_MEMORY_HOST}, + }; + xc_domain_claim_memory(xch, domid, ARRAY_SIZE(claims), claims); + + /* Replace the claim set with claims on nodes 1, 2, and 3 */ + memory_claim_t claims2[] = { + {.pages = 1024, .target = 1}, + {.pages = 1024, .target = 2}, + {.pages = 1024, .target = 3}, + }; + xc_domain_claim_memory(xch, domid, ARRAY_SIZE(claims2), claims2); + + /* Release all remaining claims once the domain is built */ + memory_claim_t clear[] = { + {.pages = 0, .target = XEN_DOMCTL_CLAIM_MEMORY_HOST} + }; + xc_domain_claim_memory(xch, domid, ARRAY_SIZE(clear), clear); + } + +Using the Xenctrl OCaml bindings +-------------------------------- + + The OCaml bindings for libxenctrl also provide an interface for installing + claim sets using the :c:expr:`XEN_DOMCTL_claim_memory` hypercall command. + + The example below shows how to install a claim set and later release it + using the OCaml bindings. + + .. code-block:: OCaml + + let install_example_claims xch domid = + let claims = [| + { Xenctrl.pages = 1024L; node = 0l }; + { Xenctrl.pages = 1024L; node = 1l }; + { Xenctrl.pages = 3072L; node = XEN_DOMCTL_CLAIM_MEMORY_TOTAL }; + |] in + Xenctrl.domain_claim_memory xch domid claims; + + let release_all_claims xch domid = + let clear = [| + { Xenctrl.pages = 0L; node = XEN_DOMCTL_CLAIM_MEMORY_TOTAL }; + |] in + Xenctrl.domain_claim_memory xch domid clear + +Call sequence diagram +--------------------- + + The following sequence diagram illustrates the call flow for claiming memory + for a domain using this hypercall command from an OCaml domain builder: + + .. mermaid:: DOMCTL_claim_memory-seqdia.mmd + :caption: Sequence diagram: Call flow for claiming memory for a domain + +Claim workflow +-------------- + + This diagram illustrates a workflow for claiming and populating memory: + + .. mermaid:: DOMCTL_claim_memory-workflow.mmd + :caption: Workflow diagram: Claiming and populating memory for a domain + +Used functions & data structures +-------------------------------- + + This diagram illustrates the key functions and data structures involved in + installing claims via the :c:expr:`XEN_DOMCTL_claim_memory` hypercall command: + + .. mermaid:: DOMCTL_claim_memory-data.mmd + :caption: Diagram: Function and data relationships for installing claims diff --git a/docs/guest-guide/dom/index.rst b/docs/guest-guide/dom/index.rst new file mode 100644 index 000000000000..cb33a230eb5d --- /dev/null +++ b/docs/guest-guide/dom/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +`DOMCTL Hypercalls` +=================== + +Through `DOMCTL` `hypercalls`, `toolstacks` in privileged domains can perform +operations related to domain management. This includes operations such as +creating, destroying, and modifying domains, as well as querying domain +information. + +.. toctree:: + :maxdepth: 2 + + DOMCTL_claim_memory diff --git a/docs/guest-guide/index.rst b/docs/guest-guide/index.rst index 5455c67479cf..d9611cd7504d 100644 --- a/docs/guest-guide/index.rst +++ b/docs/guest-guide/index.rst @@ -3,6 +3,29 @@ Guest documentation =================== +Xen exposes a set of hypercalls that allow domains and toolstacks in +privileged contexts (such as Dom0) to request services from the hypervisor. + +Through these hypercalls, privileged domains can perform privileged operations +such as querying system information, memory and domain management, +and enabling inter-domain communication via shared memory and event channels. + +These hypercalls are documented in the following sections, grouped by their +functionality. Each section provides an overview of the hypercalls, their +parameters, and examples of how to use them. + +Hypercall API documentation +--------------------------- + +.. toctree:: + :maxdepth: 2 + + dom/index + mem/index + +Hypercall ABI documentation +--------------------------- + .. toctree:: :maxdepth: 2 diff --git a/docs/guest-guide/mem/XENMEM_claim_pages.rst b/docs/guest-guide/mem/XENMEM_claim_pages.rst new file mode 100644 index 000000000000..5128317cb821 --- /dev/null +++ b/docs/guest-guide/mem/XENMEM_claim_pages.rst @@ -0,0 +1,102 @@ +.. SPDX-License-Identifier: CC-BY-4.0 +.. _XENMEM_claim_pages: + +claim_pages +*********** + + .. note:: This API is deprecated; + Use :c:expr:`XEN_DOMCTL_claim_memory` for new code. + + .. c:macro:: XENMEM_claim_pages + + Hypercall command for installing legacy claims. + + :ref:`designs/claims/installation:Legacy claim installation` describes + the API for installing legacy claims via this hypercall command. + + It passes a single claim entry to the hypervisor via a + :c:struct:`xen_memory_reservation` structure with the page count in the + :c:member:`xen_memory_reservation.nr_extents` field and the domain ID + :c:member:`xen_memory_reservation.domid` field. The claim entry's target is + implicitly global, and the legacy claim path is invoked in the hypervisor + to process the claim: + +Data structure for the hypercall command for installing legacy claims: + + .. c:struct:: xen_memory_reservation + + Structure for passing claim requests to the hypervisor via + :c:macro:`XENMEM_claim_pages` and other memory :term:`hypercalls`. + + .. code-block:: C + + struct xen_memory_reservation { + xen_pfn_t * extent_start; // not used for XENMEM_claim_pages + xen_ulong_t nr_extents; // pass page counts to claim + unsigned int extent_order; // must be 0 + unsigned int mem_flags; // XENMEMF flags. + domid_t domid; // domain to apply the claim to + }; + typedef struct xen_memory_reservation xen_memory_reservation_t; + + .. c:member:: xen_ulong_t nr_extents + + For :c:macro:`XENMEM_claim_pages`, the page count to claim. + + .. c:member:: domid_t domid + + Domain ID for the claim. + + .. c:member:: unsigned int mem_flags + + Not used for :c:macro:`XENMEM_claim_pages` (must be 0) + + In principle, it supports all the :c:expr:`XENMEMF_*` flags, including + the possibility of passing a single NUMA node ID, but using it to pass + a NUMA node ID is not currently supported by the legacy claim path. + + During review of the NUMA extension of the legacy claim path, it + was used, but the request was made to instead create a new hypercall + which is now :c:macro:`XEN_DOMCTL_claim_memory` with support for claim sets. + + .. c:member:: unsigned int extent_order + .. c:member:: xen_pfn_t *extent_start + + Both are not used for :c:macro:`XENMEM_claim_pages`, but are used for other + memory :term:`hypercalls`. + +See :ref:`designs/claims/installation:Legacy claim installation` for details. + +API example using libxenctrl +---------------------------- + + The example below claims pages, populates the domain, + and then clears the claim. + + .. code-block:: C + + #include <xenctrl.h> + + int build_with_claims(xc_interface *xch, uint32_t domid, + unsigned long nr_pages) + { + int ret; + + /* Claim pages for the domain build. */ + ret = xc_domain_claim_pages(xch, domid, nr_pages); + if ( ret < 0 ) + return ret; + + /* Populate the domain's physmap. */ + ret = xc_domain_populate_physmap(xch, domid, /* ... */); + if ( ret < 0 ) + return ret; + + /* Release any remaining claim after populating the domain memory. */ + ret = xc_domain_claim_pages(xch, domid, 0); + if ( ret < 0 ) + return ret; + + /* Unpause the domain to allow it to run. */ + return xc_unpause_domain(xch, domid); + } diff --git a/docs/guest-guide/mem/index.rst b/docs/guest-guide/mem/index.rst new file mode 100644 index 000000000000..042fb88bfbeb --- /dev/null +++ b/docs/guest-guide/mem/index.rst @@ -0,0 +1,12 @@ +.. SPDX-License-Identifier: CC-BY-4.0 + +`MEMCTL Hypercalls` +------------------- + +The XENMEM hypercall interface allows guests to perform various control +operations related to memory management. + +.. toctree:: + :maxdepth: 2 + + XENMEM_claim_pages -- 2.39.5
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |