Xen project Mailing List

[Xen-devel] [PATCH 1/3] xen: remove tmem from hypervisor

This patch removes all tmem related code and CONFIG_TMEM from the hypervisor. Also remove tmem hypercalls from the default XSM policy. It is written as if tmem is disabled and tmem freeable pages is 0. Signed-off-by: Wei Liu <wei.liu2@xxxxxxxxxx> --- docs/misc/tmem-internals.html | 789 ---------- tools/flask/policy/modules/dom0.te | 4 +- tools/flask/policy/modules/guest_features.te | 3 - xen/arch/arm/configs/tiny64.conf | 1 - xen/arch/x86/configs/pvshim_defconfig | 1 - xen/arch/x86/hvm/hypercall.c | 3 - xen/arch/x86/pv/hypercall.c | 3 - xen/arch/x86/setup.c | 8 - xen/common/Kconfig | 13 - xen/common/Makefile | 4 - xen/common/compat/tmem_xen.c | 23 - xen/common/domain.c | 3 - xen/common/memory.c | 4 +- xen/common/page_alloc.c | 40 +- xen/common/sysctl.c | 5 - xen/common/tmem.c | 2095 -------------------------- xen/common/tmem_control.c | 560 ------- xen/common/tmem_xen.c | 277 ---- xen/include/Makefile | 1 - xen/include/public/sysctl.h | 108 +- xen/include/public/tmem.h | 124 -- xen/include/xen/hypercall.h | 7 - xen/include/xen/sched.h | 3 - xen/include/xen/tmem.h | 45 - xen/include/xen/tmem_control.h | 39 - xen/include/xen/tmem_xen.h | 343 ----- xen/include/xlat.lst | 2 - 27 files changed, 7 insertions(+), 4501 deletions(-) delete mode 100644 docs/misc/tmem-internals.html delete mode 100644 xen/common/compat/tmem_xen.c delete mode 100644 xen/common/tmem.c delete mode 100644 xen/common/tmem_control.c delete mode 100644 xen/common/tmem_xen.c delete mode 100644 xen/include/public/tmem.h delete mode 100644 xen/include/xen/tmem.h delete mode 100644 xen/include/xen/tmem_control.h delete mode 100644 xen/include/xen/tmem_xen.h diff --git a/docs/misc/tmem-internals.html b/docs/misc/tmem-internals.html deleted file mode 100644 index 9b7e70e650..0000000000 --- a/docs/misc/tmem-internals.html +++ /dev/null @@ -1,789 +0,0 @@ -<h1>Transcendent Memory Internals in Xen</h1> - -by Dan Magenheimer, Oracle Corp. - -Draft 0.1 -- Updated: 20100324 -<h2>Overview</h2> - -This document focuses on the internal implementation of -Transcendent Memory (tmem) on Xen. It assumes -that the reader has a basic knowledge of the terminology, objectives, and -functionality of tmem and also has access to the Xen source code. -It corresponds to the Xen 4.0 release, with -patch added to support page deduplication (V2). - -The primary responsibilities of the tmem implementation are to: -<ul> -<li>manage a potentially huge and extremely dynamic -number of memory pages from a potentially large number of clients (domains) -with low memory overhead and proper isolation -<li>provide quick and efficient access to these -pages with as much concurrency as possible -<li>enable efficient reclamation and eviction of pages (e.g. when -memory is fully utilized) -<li>optionally, increase page density through compression and/or -deduplication -<li>where necessary, properly assign and account for -memory belonging to guests to avoid malicious and/or accidental unfairness -and/or denial-of-service -<li>record utilization statistics and make them available to management tools -</ul> -<h2>Source Code Organization</h2> - - -The source code in Xen that provides the tmem functionality -is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h. -The files tmem.c and tmem.h are intended to -be implementation- (and hypervisor-) independent and the other two files -provide the Xen-specific code. This -division is intended to make it easier to port tmem functionality to other -hypervisors, though at this time porting to other hypervisors has not been -attempted. Together, these four files -total less than 4000 lines of C code. - -Even ignoring the implementation-specific functionality, the -implementation-independent part of tmem has several dependencies on -library functionality (Xen source filenames in parentheses): -<ul> -<li> -a good fast general-purpose dynamic memory -allocator with bounded response time and efficient use of memory for a very -large number of sub-page allocations. To -achieve this in Xen, the bad old memory allocator was replaced with a -slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by -Nitin Gupta for compcache. -<li> -good tree data structure libraries, specifically -red-black trees (rbtree.c) and radix trees (radix-tree.c). -Code for these was borrowed for Linux and adapted for tmem and Xen. -<li> -good locking and list code. Both of these existed in Xen and required -little or no change. -<li> -optionally, a good fast lossless compression -library. The Xen implementation added to -support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta. -</ul> - -More information about the specific functionality of these -libraries can easily be found through a search engine, via wikipedia, or in the -Xen or Linux source logs so we will not elaborate further here. - -<h2>Prefixes/Abbreviations/Glossary</h2> - - -The tmem code uses several prefixes and abbreviations. -Knowledge of these will improve code readability: -<ul> -<li> -tmh == -transcendent memory host. Functions or -data structures that are defined by the implementation-specific code, i.e. the -Xen host code -<li> -tmemc -== transcendent memory control. -Functions or data structures that provide management tool functionality, -rather than core tmem operations. -<li> -cli or -client == client. -The tmem generic term for a domain or a guest OS. -</ul> - -When used in prose, common tmem operations are indicated -with a different font, such as <big><kbd>put</kbd></big> -and <big><kbd>get</kbd></big>. - -<h2>Key Data Structures</h2> - - -To manage a huge number of pages, efficient data structures -must be carefully selected. - -Recall that a tmem-enabled guest OS may create one or more -pools with different attributes. It then -<kbd>put</kbd></big>s and <kbd>get</kbd></big>s -pages to/from this pool, identifying the page -with a handle that consists of a pool_id, an -object_id, and a page_id (sometimes -called an index). -This suggests a few obvious core data -structures: -<ul> -<li> -When a guest OS first calls tmem, a client_t is created to contain -and track all uses of tmem by that guest OS. Among -other things, a client_t keeps pointers -to a fixed number of pools (16 in the current Xen implementation). -<li> -When a guest OS requests a new pool, a pool_t is created. -Some pools are shared and are kept in a -sharelist (sharelist_t) which points -to all the clients that are sharing the pool. -Since an object_id is 64-bits, -a pool_t must be able to keep track -of a potentially very large number of objects. -To do so, it maintains a number of parallel trees (256 in the current -Xen implementation) and a hash algorithm is applied to the object_id -to select the correct tree. -Each tree element points to an object. -Because an object_id usually represents an inode -(a unique file number identifier), and inode numbers -are fairly random, though often "clumpy", a red-black tree -is used. -<li> -When a guest first -<kbd>put</kbd></big>s a page to a pool with an as-yet-unused object_id, an -obj_t is created. Since a page_id is usually an index into a file, -it is often a small number, but may sometimes be very large (up to -32-bits). A radix tree is a good data structure to contain items -with this kind of index distribution. -<li> -When a page is -<kbd>put</kbd></big>, a page descriptor, or pgp_t, is created, which -among other things will point to the storage location where the data is kept. -In the normal case the pointer is to a pfp_t, which is an -implementation-specific datatype representing a physical pageframe in memory -(which in Xen is a "struct page_info"). -When deduplication is enabled, it points to -yet another data structure, a pcd_t -(see below). When compression is enabled -(and deduplication is not), the pointer points directly to the compressed data. -For reasons we will see shortly, each pgp_t that represents -an ephemeral page (that is, a page placed -in an ephemeral pool) is also placed -into two doubly-linked linked lists, one containing all ephemeral pages -<kbd>put</kbd></big> by the same client and one -containing all ephemeral pages across all clients ("global"). -<li> -When deduplication is enabled, multiple pgp_t's may need to point to -the same data, so another data structure (and level of indirection) is used -called a page content descriptor, or pcd_t. -Multiple page descriptors (pgp_t's) may point to the same pcd_t. -The pcd_t, in turn, points to either a pfp_t -(if a full page of data), directly to a -location in memory (if the page has been compressed or trailing zeroes have -been eliminated), or even a NULL pointer (if the page contained all zeroes and -trailing zero elimination is enabled). -</ul> - -The most apparent usage of this multi-layer web of data structures -is "top-down" because, in normal operation, the vast majority of tmem -operations invoked by a client are -<kbd>put</kbd></big>s and <kbd>get</kbd></big>s, which require the various -data structures to be walked starting with the client_t, then -a pool_t, then an obj_t, then a pgd_t. -However, there is another highly frequent tmem operation that is not -visible from a client: memory reclamation. -Since tmem attempts to use all spare memory in the system, it must -frequently free up, or evict, -pages. The eviction algorithm will be -explained in more detail later but, in brief, to free memory, ephemeral pages -are removed from the tail of one of the doubly-linked lists, which means that -all of the data structures associated with that page-to-be-removed must be -updated or eliminated and freed. As a -result, each data structure also contains a back-pointer -to its parent, for example every obj_t -contains a pointer to its containing pool_t. - -This complex web of interconnected data structures is updated constantly and -thus extremely sensitive to careless code changes which, for example, may -result in unexpected hypervisor crashes or non-obvious memory leaks. -On the other hand, the code is fairly well -modularized so, once understood, it is possible to relatively easily switch out -one kind of data structure for another. -To catch problems as quickly as possible when debug is enabled, most of -the data structures are equipped with sentinelsand many inter-function -assumptions are documented and tested dynamically -with assertions. -While these clutter and lengthen the tmem -code substantially, their presence has proven invaluable on many occasions. - -For completeness, we should also describe a key data structure in the Xen -implementation-dependent code: the tmh_page_list. For security and -performance reasons, pages that are freed due to tmem operations (such -as <kbd>get</kbd></big>) are not immediately put back into Xen's pool -of free memory (aka the Xen heap). -Tmem pages may contain guest-private data that must be scrubbed before -those memory pages are released for the use of other guests. -But if a page is immediately re-used inside of tmem itself, the entire -page is overwritten with new data, so need not be scrubbed. -Since tmem is usually the most frequent -customer of the Xen heap allocation code, it would be a waste of time to scrub -a page, release it to the Xen heap, and then immediately re-allocate it -again. So, instead, tmem maintains -currently-unused pages of memory on its own free list, tmh_page_list, -and returns the pages to Xen only when non-tmem Xen -heap allocation requests would otherwise fail. - -<h2>Scalablility/Concurrency</h2> - -Tmem has been designed to be highly scalable. -Since tmem access is invoked similarly in -many ways to asynchronous disk access, a "big SMP" tmem-aware guest -OS can, and often will, invoke tmem hypercalls simultaneously on many different -physical CPUs. And, of course, multiple -tmem-aware guests may independently and simultaneously invoke tmem -hypercalls. While the normal frequency -of tmem invocations is rarely extremely high, some tmem operations such as data -compression or lookups in a very large tree may take tens of thousands of -cycles or more to complete. Measurements -have shown that normal workloads spend no more than about 0.2% (2% with -compression enabled) of CPU time executing tmem operations. -But those familiar with OS scalability issues -recognize that even this limited execution time can create concurrency problems -in large systems and result in poorly-scalable performance. - -A good locking strategy is critical to concurrency, but also -must be designed carefully to avoid deadlock and livelock problems. For -debugging purposes, tmem supports a "big kernel lock" which disables -concurrency altogether (enabled in Xen with "tmem_lock", but note -that this functionality is rarely tested and likely has bit-rotted). Infrequent -but invasive tmem hypercalls, such as pool creation or the control operations, -are serialized on a single read-write lock, called tmem_rwlock, -which must be held for writing. All other tmem operations must hold this lock -for reading, so frequent operations such as -<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big> can execute simultaneously -as long as no invasive operations are occurring. - -Once a pool has been selected, there is a per-pool -read-write lock (pool_rwlock) which -must be held for writing if any transformative operations might occur within -that pool, such as when an obj_t is -created or destroyed. For the highly -frequent operation of finding an obj_t -within a pool, pool_rwlock must be held for reading. - -Once an object has been selected, there is a per-object -spinlock (obj_spinlock). -This is a spinlock rather than a read-write -lock because nearly all of the most frequent tmem operations (e.g. -<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big>) -are transformative, in -that they add or remove a page within the object. -This lock is generally taken whenever an -object lookup occurs and released when the tmem operation is complete. - -Next, the per-client and global ephemeral lists are -protected by a single global spinlock (eph_lists_spinlock) -and the per-client persistent lists are also protected by a single global -spinlock (pers_list_spinlock). -And to complete the description of -implementation-independent locks, if page deduplication is enabled, all pages -for which the first byte match are contained in one of 256 trees that are -protected by one of 256 corresponding read-write locks -(pcd_tree_rwlocks). - -In the Xen-specific code (tmem_xen.c), page frames (e.g. struct page_info) -that have been released are kept in a list (tmh_page_list) that -is protected by a spinlock (tmh_page_list_lock). -There is also an "implied" lock -associated with compression, which is likely the most time-consuming operation -in all of tmem (of course, only when compression is enabled): A compression -buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to -this buffer is returned to implementation-independent code and used without a -lock. - -The proper method to avoid deadlocks is to take and release -locks in a very specific predetermined order. -Unfortunately, since tmem data structures must simultaneously be -accessed "top-down" ( -<kbd>put</kbd></big> and <kbd>get</kbd></big>) -and "bottoms-up" -(memory reclamation), more complex methods must be employed: -A trylockmechanism is used (c.f. tmem_try_to_evict_pgp()), -which takes the lock if it is available but returns immediately (rather than -spinning and waiting) if the lock is not available. -When walking the ephemeral list to identify -pages to free, any page that belongs to an object that is locked is simply -skipped. Further, if the page is the -last page belonging to an object, and the pool read-write lock for the pool the -object belongs to is not available (for writing), that object is skipped. -These constraints modify the LRU algorithm -somewhat, but avoid the potential for deadlock. - -Unfortunately, a livelock was still discovered in this approach: -When memory is scarce and each client is -<kbd>put</kbd></big>ting a large number of pages -for exactly one object (and thus holding the object spinlock for that object), -memory reclamation takes a very long time to determine that it is unable to -free any pages, and so the time to do a -<kbd>put</kbd></big> (which eventually fails) becomes linear to the -number of pages in the object! To avoid -this situation, a workaround was added to always ensure a minimum amount of -memory (1MB) is available before any object lock is taken for the client -invoking tmem (see tmem_ensure_avail_pages()). -Other such livelocks (and perhaps deadlocks) -may be lurking. - -A last issue related to concurrency is atomicity of counters. -Tmem gathers a large number of -statistics. Some of these counters are -informational only, while some are critical to tmem operation and must be -incremented and decremented atomically to ensure, for example, that the number -of pages in a tree never goes negative if two concurrent tmem operations access -the counter exactly simultaneously. Some -of the atomic counters are used for debugging (in assertions) and perhaps need -not be atomic; fixing these may increase performance slightly by reducing -cache-coherency traffic. Similarly, some -of the non-atomic counters may yield strange results to management tools, such -as showing the total number of successful -<kbd>put</kbd></big>s as being higher than the number of -<kbd>put</kbd></big>s attempted. -These are left as exercises for future tmem implementors. - -<h2>Control and Manageability</h2> - - -Tmem has a control interface to, for example, set various -parameters and obtain statistics. All -tmem control operations funnel through do_tmem_control() -and other functions supporting tmem control operations are prefixed -with tmemc_. - - -During normal operation, even if only one tmem-aware guest -is running, tmem may absorb nearly all free memory in the system for its own -use. Then if a management tool wishes to -create a new guest (or migrate a guest from another system to this one), it may -notice that there is insufficient "free" memory and fail the creation -(or migration). For this reason, tmem -introduces a new tool-visible class of memory -- freeable memory -- -and provides a control interface to access -it. All ephemeral memory and all pages on the tmh_page_list -are freeable. To properly access freeable -memory, a management tool must follow a sequence of steps: -<ul> -<li> -freeze -tmem:When tmem is frozen, all -<kbd>put</kbd></big>s fail, which ensures that no -additional memory may be absorbed by tmem. -(See tmemc_freeze_pools(), and -note that individual clients may be frozen, though this functionality may be -used only rarely.) -<li> -query freeable MB: If all freeable memory were released to the Xen -heap, this is the amount of memory (in MB) that would be freed. -See tmh_freeable_pages(). -<li> -flush: -Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g. -back to the Xen heap. This amount is -specified in KB. See tmemc_flush_mem() and tmem_relinquish_npages(). -<li> -At this point the management tool may allocate -the memory, e.g. using Xen's published interfaces. -<li> -thaw -tmem: This terminates the freeze, allowing tmem to accept -<kbd>put</kbd></big>s again. -</ul> - -Extensive tmem statistics are available through tmem's -control interface (see tmemc_list and -the separate source for the "xm tmem-list" command and the -xen-tmem-list-parse tool). To maximize -forward/backward compatibility with future tmem and tools versions, statistical -information is passed via an ASCII interface where each individual counter is -identified by an easily parseable two-letter ASCII sequence. - -<h2>Save/Restore/Migrate</h2> - - -Another piece of functionality that has a major impact on -the tmem code is support for save/restore of a tmem client and, highly related, -live migration of a tmem client. -Ephemeral pages, by definition, do not need to be saved or -live-migrated, but persistent pages are part of the state of a running VM and -so must be properly preserved. - -When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step -is for the tmem client to be frozen (see the manageability section). -Next, tmem API version information is -recorded (to avoid possible incompatibility issues as the tmem spec evolves in -the future). Then, certain high-level -tmem structural information specific to the client is recorded, including -information about the existing pools. -Finally, the contents of all persistent pages are recorded. - -For live-migration, the process is somewhat more complicated. -Ignoring tmem for a moment, recall that in -live migration, the vast majority of the VM's memory is transferred while the -VM is still fully operational. During -each phase, memory pages belonging to the VM that are changed are marked and -then retransmitted during a later phase. -Eventually only a small amount of memory remains, the VM is paused, the -remaining memory is transmitted, and the VM is unpaused on the target machine. - -The number of persistent tmem pages may be quite large, -possibly even larger than all the other memory used by the VM; so it is -unacceptable to transmit persistent tmem pages during the "paused" -phase of live migration. But if the VM -is still operational, it may be making calls to tmem: -A frozen tmem client will reject any -<big><kbd>put</kbd></big> operations, but tmem must -still correctly process <big><kbd>flush</kbd></big>es -(page and object), including implicit flushes due to duplicate -<big><kbd>put</kbd></big>s. -Fortunately, these operations can only -invalidate tmem pages, not overwrite tmem pages or create new pages. -So, when a live-migrate has been initiated, -the client is frozen. Then during the -"live" phase, tmem transmits all persistent pages, but also records -the handle of all persistent pages that are invalidated. -Then, during the "paused" phase, -only the handles of invalidated persistent pages are transmitted, resulting in -the invalidation on the target machine of any matching pages that were -previously transmitted during the "live" phase. - -For restore (and on the target machine of a live migration), -tmem must be capable of reconstructing the internal state of the client from -the saved/migrated data. However, it is -not the client itself that is <big><kbd>put</kbd></big>'ing -the pages but the management tools conducting the restore/migration. -This slightly complicates tmem by requiring -new API calls and new functions in the implementation, but the code is -structured so that duplication is minimized. -Once all tmem data structures for the client are reconstructed, all -persistent pages are recreated and, in the case of live-migration, all -invalidations have been processed and the client has been thawed, the restored -client can be resumed. - -Finally, tmem's data structures must be cluttered a bit to -support save/restore/migration. Notably, -a per-pool list of persistent pages must be maintained and, during live -migration, a per-client list of invalidated pages must be logged. -A reader of the code will note that these -lists are overlaid into space-sensitive data structures as a union, which may -be more error-prone but eliminates significant space waste. - -<h2>Miscellaneous Tmem Topics</h2> - - -Duplicate <big><kbd>puts</kbd></big>. -One interesting corner case that -significantly complicates the tmem source code is the possibility -of a duplicate -<big><kbd>put</kbd></big>, -which occurs when two -<big><kbd>put</kbd></big>s -are requested with the same handle but with possibly different data. -The tmem API addresses - -<big><kbd>put</kbd></big>-<big><kbd>put</kbd></big>-<big><kbd>get</kbd></big> -coherence explicitly: When a duplicate -<big><kbd>put</kbd></big> occurs, tmem may react one of two ways: (1) The -<big><kbd>put</kbd></big> may succeed with the old -data overwritten by the new data, or (2) the -<big><kbd>put</kbd></big> may be failed with the original data flushed and -neither the old nor the new data accessible. -Tmem may not fail the -<big><kbd>put</kbd></big> and leave the old data accessible. - -When tmem has been actively working for an extended period, -system memory may be in short supply and it is possible for a memory allocation -for a page (or even a data structure such as a pgd_t) to fail. Thus, -for a duplicate -<big><kbd>put</kbd></big>, it may be impossible for tmem to temporarily -simultaneously maintain data structures and data for both the original -<big><kbd>put</kbd></big> and the duplicate -<big><kbd>put</kbd></big>. -When the space required for the data is -identical, tmem may be able to overwrite in place the old data with -the new data (option 1). But in some circumstances, such as when data -is being compressed, overwriting is not always possible and option 2 must be -performed. - -Page deduplication and trailing-zero elimination. -When page deduplication is enabled -("tmem_dedup" option to Xen), ephemeral pages for which the contents -are identical -- whether the pages belong -to the same client or different clients -- utilize the same pageframe of -memory. In Xen environments where -multiple domains have a highly similar workload, this can save a substantial -amount of memory, allowing a much larger number of ephemeral pages to be -used. Tmem page deduplication uses -methods similar to the KSM implementation in Linux [ref], but differences between -the two are sufficiently great that tmem does not directly leverage the -code. In particular, ephemeral pages in -tmem are never dirtied, so need never be copied-on-write. -Like KSM, however, tmem avoids hashing, -instead employing red-black trees -that use the entire page contents as the lookup -key. There may be better ways to implement this. - -Dedup'ed pages may optionally be compressed -("tmem_compress" and "tmem_dedup" Xen options specified), -to save even more space, at the cost of more time. -Additionally, trailing zero elimination (tze) may be applied to dedup'ed -pages. With tze, pages that contain a -significant number of zeroes at the end of the page are saved without the trailing -zeroes; an all-zero page requires no data to be saved at all. -In certain workloads that utilize a large number -of small files (and for which the last partial page of a file is padded with -zeroes), a significant space savings can be realized without the high cost of -compression/decompression. - -Both compression and tze significantly complicate memory -allocation. This will be discussed more below. - -Memory accounting. -Accounting is boring, but poor accounting may -result in some interesting problems. In -the implementation-independent code of tmem, most data structures, page frames, -and partial pages (e.g. for compresssion) are billed to a pool, -and thus to a client. Some infrastructure data structures, such as -pools and clients, are allocated with tmh_alloc_infra(), which does not -require a pool to be specified. Two other -exceptions are page content descriptors (pcd_t) -and sharelists (sharelist_t) which -are explicitly not associated with a pool/client by specifying NULL instead of -a pool_t. -(Note to self: -These should probably just use the tmh_alloc_infra() interface too.) -As we shall see, persistent pool pages and -data structures may need to be handled a bit differently, so the -implementation-independent layer calls a different allocation/free routine for -persistent pages (e.g. tmh_alloc_page_thispool()) -than for ephemeral pages (e.g. tmh_alloc_page()). - -In the Xen-specific layer, we -disregard the pool_t for ephemeral -pages, as we use the generic Xen heap for all ephemeral pages and data -structures.(Denial-of-service attacks -can be handled in the implementation-independent layer because ephemeral pages -are kept in per-client queues each with a counted length. -See the discussion on weights and caps below.) -However we explicitly bill persistent pages -and data structures against the client/domain that is using them. -(See the calls to the Xen routine alloc_domheap_pages() in tmem_xen.h; of -the first argument is a domain, the pages allocated are billed by Xen to that -domain.)This means that a Xen domain -cannot allocate even a single tmem persistent page when it is currently utilizing -its maximum assigned memory allocation! -This is reasonable for persistent pages because, even though the data is -not directly accessible by the domain, the data is permanently saved until -either the domain flushes it or the domain dies. - -Note that proper accounting requires (even for ephemeral pools) that the same -pool is referenced when memory is freed as when it was allocated, even if the -ownership of a pool has been moved from one client to another (c.f. shared_pool_reassign()). -The underlying Xen-specific information may -not always enforce this for ephemeral pools, but incorrect alloc/free matching -can cause some difficult-to-find memory leaks and bent pointers. - -Page deduplication is not possible for persistent pools for -accounting reasons: Imagine a page that is created by persistent pool A, which -belongs to a domain that is currently well under its maximum allocation. -Then the pcd_tis matched by persistent pool B, which is -currently at its maximum. -Then the domain owning pool A is destroyed. -Is B beyond its maximum? -(There may be a clever way around this -problem. Exercise for the reader!) - -Memory allocation. The implementation-independent layer assumes -there is a good fast general-purpose dynamic memory allocator with bounded -response time and efficient use of memory for a very large number of sub-page -allocations. The old xmalloc memory -allocator in Xen was not a good match for this purpose, so was replaced by the -TLSF allocator. Note that the TLSF -allocator is used only for allocations smaller than a page (and, more -precisely, no larger than tmem_subpage_maxsize()); -full pages are allocated by Xen's normal heap allocator. - -After the TLSF allocator was integrated into Xen, more work -was required so that each client could allocate memory from a separate -independent pool. (See the call to xmem_pool_create()in -tmh_client_init().) -This allows the data structures allocated for the -purpose of supporting persistent pages to be billed to the same client as the -pages themselves. It also allows partial -(e.g. compressed) pages to be properly billed. -Further, when partial page allocations cause internal fragmentation, -this fragmentation can be isolated per-client. -And, when a domain dies, full pages can be freed, rather than only -partial pages. One other change was -required in the TLSF allocator: In the original version, when a TLSF memory -pool was allocated, the first page of memory was also allocated. -Since, for a persistent pool, this page would -be billed to the client, the allocation of the first page failed if the domain -was started at its maximum memory, and this resulted in a failure to create the -memory pool. To avoid this, the code was -changed to delay the allocation of the first page until first use of the memory -pool. - -Memory allocation interdependency. -As previously described, -pages of memory must be moveable back and forth between the Xen heap and the -tmem ephemeral lists (and page lists). -When tmem needs a page but doesn't have one, it requests one from the -Xen heap (either indirectly via xmalloc, or directly via Xen's alloc_domheap_pages()). -And when Xen needs a page but doesn't have -one, it requests one from tmem (via a call to tmem_relinquish_pages() in Xen's alloc_heap_pages() in page_alloc.c). -This leads to a potential infinite loop! -To break this loop, a new memory flag (MEMF_tmem) was added to Xen -to flag and disallow the loop. -See tmh_called_from_tmem() -in tmem_relinquish_pages(). -Note that the tmem_relinquish_pages() interface allows for memory requests of -order > 0 (multiple contiguous pages), but the tmem implementation disallows -any requests larger than a single page. - -LRU page reclamation. -Ephemeral pages generally age in -a queue, and the space associated with the oldest -- or least-recently-used -- page is reclaimed when tmem needs more -memory. But there are a few exceptions -to strict LRU queuing. First is when -removal from a queue is constrained by locks, as previously described above. -Second, when an ephemeral pool is shared, unlike a private ephemeral -pool, a -<big><kbd>get</kbd></big> -does not imply a -<big><kbd>flush</kbd></big> -Instead, in a shared pool, a -results in the page being promoted to the front of the queue. -Third, when a page that is deduplicated (i.e. -is referenced by more than one pgp_t) -reaches the end of the LRU queue, it is marked as eviction attempted and promoted to the front of the queue; if it -reaches the end of the queue a second time, eviction occurs. -Note that only the pgp_t is evicted; the actual data is only reclaimed if there is no -other pgp_t pointing to the data. - -All of these modified- LRU algorithms deserve to be studied -carefully against a broad range of workloads. - -Internal fragmentation. -When -compression or tze is enabled, allocations between a half-page and a full-page -in size are very common and this places a great deal of pressure on even the -best memory allocator. Additionally, -problems may be caused for memory reclamation: When one tmem ephemeral page is -evicted, only a fragment of a physical page of memory might be reclaimed. -As a result, when compression or tze is -enabled, it may take a very large number of eviction attempts to free up a full -contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction -must be assumed to be able to fail. -While all memory allocation paths in tmem are resilient to failure, very -complex corner cases may eventually occur. -As a result, compression and tze are disabled by default and should be -used with caution until they have been tested with a much broader set of -workloads.(Note to self: The -code needs work.) - -Weights and caps. -Because -of the just-discussed LRU-based eviction algorithms, a client that uses tmem at -a very high frequency can quickly swamp tmem so that it provides little benefit -to a client that uses it less frequently. -To reduce the possibility of this denial-of-service, limits can be -specified via management tools that are enforced internally by tmem. -On Xen, the "xm tmem-set" command -can specify "weight=<weight>" or "cap=<cap>" -for any client. If weight is non-zero -for a client and the current percentage of ephemeral pages in use by the client -exceeds its share (as measured by the sum of weights of all clients), the next -page chosen for eviction is selected from the requesting client's ephemeral -queue, instead of the global ephemeral queue that contains pages from all -clients.(See client_over_quota().) -Setting a cap for a client is currently a no-op. - -Shared pools and authentication. -When tmem was first proposed to the linux kernel mailing list -(LKML), there was concern expressed about security of shared ephemeral -pools. The initial tmem implementation only -required a client to provide a 128-bit UUID to identify a shared pool, and the -linux-side tmem implementation obtained this UUID from the superblock of the -shared filesystem (in ocfs2). It was -pointed out on LKML that the UUID was essentially a security key and any -malicious domain that guessed it would have access to any data from the shared -filesystem that found its way into tmem. -Ocfs2 has only very limited security; it is assumed that anyone who can -access the filesystem bits on the shared disk can mount the filesystem and use -it. But in a virtualized data center, -higher isolation requirements may apply. -As a result, management tools must explicitly authenticate (or may -explicitly deny) shared pool access to any client. -On Xen, this is done with the "xl -tmem-shared-auth" command. - -32-bit implementation. -There was some effort put into getting tmem working on a 32-bit Xen. -However, the Xen heap is limited in size on -32-bit Xen so tmem did not work very well. -There are still 32-bit ifdefs in some places in the code, but things may -have bit-rotted so using tmem on a 32-bit Xen is not recommended. - -<h2>Known Issues</h2> - -Fragmentation.When tmem -is active, all physically memory becomes fragmented -into individual pages. However, the Xen -memory allocator allows memory to be requested in multi-page contiguous -quantities, called order>0 allocations. -(e.g. 2order so -order==4 is sixteen contiguous pages.) -In some cases, a request for a larger order will fail gracefully if no -matching contiguous allocation is available from Xen. -As of Xen 4.0, however, there are several -critical order>0 allocation requests that do not fail gracefully. -Notably, when a domain is created, and -order==4 structure is required or the domain creation will fail. -And shadow paging requires many order==2 -allocations; if these fail, a PV live-migration may fail. -There are likely other such issues. - -But, fragmentation can occur even without tmem if any domU does -any extensive ballooning; tmem just accelerates the fragmentation. -So the fragmentation problem must be solved -anyway. The best solution is to disallow -order>0 allocations altogether in Xen -- or at least ensure that any attempt -to allocate order>0 can fail gracefully, e.g. by falling back to a sequence -of single page allocations. However this restriction may require a major rewrite -in some of Xen's most sensitive code. -(Note that order>0 allocations during Xen boot and early in domain0 -launch are safe and, if dom0 does not enable tmem, any order>0 allocation by -dom0 is safe, until the first domU is created.) - -Until Xen can be rewritten to be fragmentation-safe, a small hack -was added in the Xen page -allocator.(See the comment " -memory is scarce" in alloc_heap_pages().) -Briefly, a portion of memory is pre-reserved -for allocations where order>0 and order<9. -(Domain creation uses 2MB pages, but fails -gracefully, and there are no other known order==9 allocations or order>9 -allocations currently in Xen.) - -NUMA. Tmem assumes that -all memory pages are equal and any RAM page can store a page of data for any -client. This has potential performance -consequences in any NUMA machine where access to far memory is significantly slower than access to near memory. -On nearly all of today's servers, however, -access times to far memory is still -much faster than access to disk or network-based storage, and tmem's primary performance -advantage comes from the fact that paging and swapping are reduced. -So, the current tmem implementation ignores -NUMA-ness; future tmem design for NUMA machines is an exercise left for the -reader. - -<h2>Bibliography</h2> - - -(needs work) -<a href="http://oss.oracle.com/projects/tmem";>http://oss.oracle.com/projects/tmem</a> diff --git a/tools/flask/policy/modules/dom0.te b/tools/flask/policy/modules/dom0.te index a347d664f8..9970f9dc08 100644 --- a/tools/flask/policy/modules/dom0.te +++ b/tools/flask/policy/modules/dom0.te @@ -10,8 +10,8 @@ allow dom0_t xen_t:xen { settime tbufcontrol readconsole clearconsole perfcontrol mtrr_add mtrr_del mtrr_read microcode physinfo quirk writeconsole readapic writeapic privprofile nonprivprofile kexec firmware sleep frequency - getidle debug getcpuinfo heap pm_op mca_op lockprof cpupool_op tmem_op - tmem_control getscheduler setscheduler + getidle debug getcpuinfo heap pm_op mca_op lockprof cpupool_op + getscheduler setscheduler }; allow dom0_t xen_t:xen2 { resource_op psr_cmt_op psr_alloc pmu_ctrl get_symbol diff --git a/tools/flask/policy/modules/guest_features.te b/tools/flask/policy/modules/guest_features.te index 9ac9780ded..1b77832aea 100644 --- a/tools/flask/policy/modules/guest_features.te +++ b/tools/flask/policy/modules/guest_features.te @@ -1,6 +1,3 @@ -# Allow all domains to use (unprivileged parts of) the tmem hypercall -allow domain_type xen_t:xen tmem_op; - # Allow all domains to use PMU (but not to change its settings --- that's what # pmu_ctrl is for) allow domain_type xen_t:xen2 pmu_use; diff --git a/xen/arch/arm/configs/tiny64.conf b/xen/arch/arm/configs/tiny64.conf index aecc55c95f..cc6d93f2f8 100644 --- a/xen/arch/arm/configs/tiny64.conf +++ b/xen/arch/arm/configs/tiny64.conf @@ -11,7 +11,6 @@ CONFIG_ARM=y # # Common Features # -# CONFIG_TMEM is not set CONFIG_SCHED_CREDIT=y # CONFIG_SCHED_CREDIT2 is not set # CONFIG_SCHED_RTDS is not set diff --git a/xen/arch/x86/configs/pvshim_defconfig b/xen/arch/x86/configs/pvshim_defconfig index a12e3d0465..9710aa6238 100644 --- a/xen/arch/x86/configs/pvshim_defconfig +++ b/xen/arch/x86/configs/pvshim_defconfig @@ -11,7 +11,6 @@ CONFIG_NR_CPUS=32 # CONFIG_HVM_FEP is not set # CONFIG_TBOOT is not set # CONFIG_KEXEC is not set -# CONFIG_TMEM is not set # CONFIG_XENOPROF is not set # CONFIG_XSM is not set # CONFIG_SCHED_CREDIT2 is not set diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c index 19d126377a..b52f7b2f09 100644 --- a/xen/arch/x86/hvm/hypercall.c +++ b/xen/arch/x86/hvm/hypercall.c @@ -131,9 +131,6 @@ static const hypercall_table_t hvm_hypercall_table[] = { HYPERCALL(hvm_op), HYPERCALL(sysctl), HYPERCALL(domctl), -#ifdef CONFIG_TMEM - HYPERCALL(tmem_op), -#endif COMPAT_CALL(platform_op), #ifdef CONFIG_PV COMPAT_CALL(mmuext_op), diff --git a/xen/arch/x86/pv/hypercall.c b/xen/arch/x86/pv/hypercall.c index 5d11911735..3a67b7e64f 100644 --- a/xen/arch/x86/pv/hypercall.c +++ b/xen/arch/x86/pv/hypercall.c @@ -74,9 +74,6 @@ const hypercall_table_t pv_hypercall_table[] = { #ifdef CONFIG_KEXEC COMPAT_CALL(kexec_op), #endif -#ifdef CONFIG_TMEM - HYPERCALL(tmem_op), -#endif HYPERCALL(xenpmu_op), #ifdef CONFIG_HVM HYPERCALL(hvm_op), diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index 9cbff22fb3..3621f986f9 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -25,7 +25,6 @@ #include <xen/dmi.h> #include <xen/pfn.h> #include <xen/nodemask.h> -#include <xen/tmem_xen.h> #include <xen/virtual_region.h> #include <xen/watchdog.h> #include <public/version.h> @@ -1478,13 +1477,6 @@ void __init noreturn __start_xen(unsigned long mbi_p) s = pfn_to_paddr(limit + 1); init_domheap_pages(s, e); } - - if ( tmem_enabled() ) - { - printk(XENLOG_WARNING - "TMEM physical RAM limit exceeded, disabling TMEM\n"); - tmem_disable(); - } } else end_boot_allocator(); diff --git a/xen/common/Kconfig b/xen/common/Kconfig index 68132a3a10..fb719ac237 100644 --- a/xen/common/Kconfig +++ b/xen/common/Kconfig @@ -77,19 +77,6 @@ config KEXEC If unsure, say Y. -config TMEM - def_bool y - prompt "Transcendent Memory Support" if EXPERT = "y" - ---help--- - Transcendent memory allows PV-aware guests to collaborate on memory - usage. Guests can 'swap' their memory to the hypervisor or have an - collective pool of memory shared across guests. The end result is - less memory usage by guests allowing higher guest density. - - You also have to enable it on the Xen commandline by using tmem=1 - - If unsure, say Y. - config XENOPROF def_bool y prompt "Xen Oprofile Support" if EXPERT = "y" diff --git a/xen/common/Makefile b/xen/common/Makefile index ffdfb7448d..02763290a9 100644 --- a/xen/common/Makefile +++ b/xen/common/Makefile @@ -71,10 +71,6 @@ obj-bin-$(CONFIG_X86) += $(foreach n,decompress bunzip2 unxz unlzma unlzo unlz4 obj-$(CONFIG_COMPAT) += $(addprefix compat/,domain.o kernel.o memory.o multicall.o xlat.o) -tmem-y := tmem.o tmem_xen.o tmem_control.o -tmem-$(CONFIG_COMPAT) += compat/tmem_xen.o -obj-$(CONFIG_TMEM) += $(tmem-y) - extra-y := symbols-dummy.o subdir-$(CONFIG_COVERAGE) += coverage diff --git a/xen/common/compat/tmem_xen.c b/xen/common/compat/tmem_xen.c deleted file mode 100644 index 5111fd8df6..0000000000 --- a/xen/common/compat/tmem_xen.c +++ /dev/null @@ -1,23 +0,0 @@ -/****************************************************************************** - * tmem_xen.c - * - */ - -#include <xen/lib.h> -#include <xen/sched.h> -#include <xen/domain.h> -#include <xen/guest_access.h> -#include <xen/hypercall.h> -#include <compat/tmem.h> - -CHECK_tmem_oid; - -/* - * Local variables: - * mode: C - * c-file-style: "BSD" - * c-basic-offset: 4 - * tab-width: 4 - * indent-tabs-mode: nil - * End: - */ diff --git a/xen/common/domain.c b/xen/common/domain.c index 78cc5249e8..3362ad3ad3 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -40,7 +40,6 @@ #include <public/vcpu.h> #include <xsm/xsm.h> #include <xen/trace.h> -#include <xen/tmem.h> #include <asm/setup.h> #ifdef CONFIG_X86 @@ -719,10 +718,8 @@ int domain_kill(struct domain *d) d->is_dying = DOMDYING_dying; evtchn_destroy(d); gnttab_release_mappings(d); - tmem_destroy(d->tmem_client); vnuma_destroy(d->vnuma); domain_set_outstanding_pages(d, 0); - d->tmem_client = NULL; /* fallthrough */ case DOMDYING_dying: rc = domain_relinquish_resources(d); diff --git a/xen/common/memory.c b/xen/common/memory.c index 175bd62c11..21b1e65bb9 100644 --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -18,8 +18,6 @@ #include <xen/guest_access.h> #include <xen/hypercall.h> #include <xen/errno.h> -#include <xen/tmem.h> -#include <xen/tmem_xen.h> #include <xen/numa.h> #include <xen/mem_access.h> #include <xen/trace.h> @@ -250,7 +248,7 @@ static void populate_physmap(struct memop_args *a) if ( unlikely(!page) ) { - if ( !tmem_enabled() || a->extent_order ) + if ( a->extent_order ) gdprintk(XENLOG_INFO, "Could not allocate order=%u extent: id=%d memflags=%#x (%u of %u)\n", a->extent_order, d->domain_id, a->memflags, diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index fd3b0aaa83..bb19b026a8 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -135,8 +135,6 @@ #include <xen/numa.h> #include <xen/nodemask.h> #include <xen/event.h> -#include <xen/tmem.h> -#include <xen/tmem_xen.h> #include <public/sysctl.h> #include <public/sched.h> #include <asm/page.h> @@ -529,16 +527,6 @@ int domain_set_outstanding_pages(struct domain *d, unsigned long pages) /* how much memory is available? */ avail_pages = total_avail_pages; - /* Note: The usage of claim means that allocation from a guest *might* - * have to come from freeable memory. Using free memory is always better, if - * it is available, than using freeable memory. - * - * But that is OK as once the claim has been made, it still can take minutes - * before the claim is fully satisfied. Tmem can make use of the unclaimed - * pages during this time (to store ephemeral/freeable pages only, - * not persistent pages). - */ - avail_pages += tmem_freeable_pages(); avail_pages -= outstanding_claims; /* @@ -710,8 +698,7 @@ static void __init setup_low_mem_virq(void) static void check_low_mem_virq(void) { - unsigned long avail_pages = total_avail_pages + - tmem_freeable_pages() - outstanding_claims; + unsigned long avail_pages = total_avail_pages - outstanding_claims; if ( unlikely(avail_pages <= low_mem_virq_th) ) { @@ -940,8 +927,7 @@ static struct page_info *alloc_heap_pages( * Claimed memory is considered unavailable unless the request * is made by a domain with sufficient unclaimed pages. */ - if ( (outstanding_claims + request > - total_avail_pages + tmem_freeable_pages()) && + if ( (outstanding_claims + request > total_avail_pages) && ((memflags & MEMF_no_refcount) || !d || d->outstanding_pages < request) ) { @@ -949,22 +935,6 @@ static struct page_info *alloc_heap_pages( return NULL; } - /* - * TMEM: When available memory is scarce due to tmem absorbing it, allow - * only mid-size allocations to avoid worst of fragmentation issues. - * Others try tmem pools then fail. This is a workaround until all - * post-dom0-creation-multi-page allocations can be eliminated. - */ - if ( ((order == 0) || (order >= 9)) && - (total_avail_pages <= midsize_alloc_zone_pages) && - tmem_freeable_pages() ) - { - /* Try to free memory from tmem. */ - pg = tmem_relinquish_pages(order, memflags); - spin_unlock(&heap_lock); - return pg; - } - pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d); /* Try getting a dirty buddy if we couldn't get a clean one. */ if ( !pg && !(memflags & MEMF_no_scrub) ) @@ -1444,10 +1414,6 @@ static void free_heap_pages( else pg->u.free.first_dirty = INVALID_DIRTY_IDX; - if ( tmem_enabled() ) - midsize_alloc_zone_pages = max( - midsize_alloc_zone_pages, total_avail_pages / MIDSIZE_ALLOC_FRAC); - /* Merge chunks as far as possible. */ while ( order < MAX_ORDER ) { @@ -2265,7 +2231,7 @@ int assign_pages( { if ( unlikely((d->tot_pages + (1 << order)) > d->max_pages) ) { - if ( !tmem_enabled() || order != 0 || d->tot_pages != d->max_pages ) + if ( order != 0 || d->tot_pages != d->max_pages ) gprintk(XENLOG_INFO, "Over-allocation for domain %u: " "%u > %u\n", d->domain_id, d->tot_pages + (1 << order), d->max_pages); diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c index c0aa6bde4e..765effde8d 100644 --- a/xen/common/sysctl.c +++ b/xen/common/sysctl.c @@ -13,7 +13,6 @@ #include <xen/domain.h> #include <xen/event.h> #include <xen/domain_page.h> -#include <xen/tmem.h> #include <xen/trace.h> #include <xen/console.h> #include <xen/iocap.h> @@ -456,10 +455,6 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) } #endif - case XEN_SYSCTL_tmem_op: - ret = tmem_control(&op->u.tmem_op); - break; - case XEN_SYSCTL_livepatch_op: ret = livepatch_op(&op->u.livepatch); if ( ret != -ENOSYS && ret != -EOPNOTSUPP ) diff --git a/xen/common/tmem.c b/xen/common/tmem.c deleted file mode 100644 index c077f87e77..0000000000 --- a/xen/common/tmem.c +++ /dev/null @@ -1,2095 +0,0 @@ -/****************************************************************************** - * tmem.c - * - * Transcendent memory - * - * Copyright (c) 2009, Dan Magenheimer, Oracle Corp. - */ - -/* TODO list: 090129 (updated 100318) - - any better reclamation policy? - - use different tlsf pools for each client (maybe each pool) - - test shared access more completely (ocfs2) - - add feedback-driven compression (not for persistent pools though!) - - add data-structure total bytes overhead stats - */ - -#ifdef __XEN__ -#include <xen/tmem_xen.h> /* host-specific (eg Xen) code goes here. */ -#endif - -#include <public/sysctl.h> -#include <xen/tmem.h> -#include <xen/rbtree.h> -#include <xen/radix-tree.h> -#include <xen/list.h> -#include <xen/init.h> - -#define TMEM_SPEC_VERSION 1 - -struct tmem_statistics tmem_stats = { - .global_obj_count = ATOMIC_INIT(0), - .global_pgp_count = ATOMIC_INIT(0), - .global_pcd_count = ATOMIC_INIT(0), - .global_page_count = ATOMIC_INIT(0), - .global_rtree_node_count = ATOMIC_INIT(0), -}; - -/************ CORE DATA STRUCTURES ************************************/ - -struct tmem_object_root { - struct xen_tmem_oid oid; - struct rb_node rb_tree_node; /* Protected by pool->pool_rwlock. */ - unsigned long objnode_count; /* Atomicity depends on obj_spinlock. */ - long pgp_count; /* Atomicity depends on obj_spinlock. */ - struct radix_tree_root tree_root; /* Tree of pages within object. */ - struct tmem_pool *pool; - domid_t last_client; - spinlock_t obj_spinlock; -}; - -struct tmem_object_node { - struct tmem_object_root *obj; - struct radix_tree_node rtn; -}; - -struct tmem_page_descriptor { - union { - struct list_head global_eph_pages; - struct list_head client_inv_pages; - }; - union { - struct { - union { - struct list_head client_eph_pages; - struct list_head pool_pers_pages; - }; - struct tmem_object_root *obj; - } us; - struct xen_tmem_oid inv_oid; /* Used for invalid list only. */ - }; - pagesize_t size; /* 0 == PAGE_SIZE (pfp), -1 == data invalid, - else compressed data (cdata). */ - uint32_t index; - bool eviction_attempted; /* CHANGE TO lifetimes? (settable). */ - union { - struct page_info *pfp; /* Page frame pointer. */ - char *cdata; /* Compressed data. */ - struct tmem_page_content_descriptor *pcd; /* Page dedup. */ - }; - union { - uint64_t timestamp; - uint32_t pool_id; /* Used for invalid list only. */ - }; -}; - -#define PCD_TZE_MAX_SIZE (PAGE_SIZE - (PAGE_SIZE/64)) - -struct tmem_page_content_descriptor { - union { - struct page_info *pfp; /* Page frame pointer. */ - char *cdata; /* If compression_enabled. */ - }; - pagesize_t size; /* If compression_enabled -> 0<size<PAGE_SIZE (*cdata) - * else if tze, 0<=size<PAGE_SIZE, rounded up to mult of 8 - * else PAGE_SIZE -> *pfp. */ -}; - -static int tmem_initialized = 0; - -struct xmem_pool *tmem_mempool = 0; -unsigned int tmem_mempool_maxalloc = 0; - -DEFINE_SPINLOCK(tmem_page_list_lock); -PAGE_LIST_HEAD(tmem_page_list); -unsigned long tmem_page_list_pages = 0; - -DEFINE_RWLOCK(tmem_rwlock); -static DEFINE_SPINLOCK(eph_lists_spinlock); /* Protects global AND clients. */ -static DEFINE_SPINLOCK(pers_lists_spinlock); - -#define ASSERT_SPINLOCK(_l) ASSERT(spin_is_locked(_l)) -#define ASSERT_WRITELOCK(_l) ASSERT(rw_is_write_locked(_l)) - - atomic_t client_weight_total; - -struct tmem_global tmem_global = { - .ephemeral_page_list = LIST_HEAD_INIT(tmem_global.ephemeral_page_list), - .client_list = LIST_HEAD_INIT(tmem_global.client_list), - .client_weight_total = ATOMIC_INIT(0), -}; - -/* - * There two types of memory allocation interfaces in tmem. - * One is based on xmem_pool and the other is used for allocate a whole page. - * Both of them are based on the lowlevel function __tmem_alloc_page/_thispool(). - * The call trace of alloc path is like below. - * Persistant pool: - * 1.tmem_malloc() - * > xmem_pool_alloc() - * > tmem_persistent_pool_page_get() - * > __tmem_alloc_page_thispool() - * 2.tmem_alloc_page() - * > __tmem_alloc_page_thispool() - * - * Ephemeral pool: - * 1.tmem_malloc() - * > xmem_pool_alloc() - * > tmem_mempool_page_get() - * > __tmem_alloc_page() - * 2.tmem_alloc_page() - * > __tmem_alloc_page() - * - * The free path is done in the same manner. - */ -static void *tmem_malloc(size_t size, struct tmem_pool *pool) -{ - void *v = NULL; - - if ( (pool != NULL) && is_persistent(pool) ) { - if ( pool->client->persistent_pool ) - v = xmem_pool_alloc(size, pool->client->persistent_pool); - } - else - { - ASSERT( size < tmem_mempool_maxalloc ); - ASSERT( tmem_mempool != NULL ); - v = xmem_pool_alloc(size, tmem_mempool); - } - if ( v == NULL ) - tmem_stats.alloc_failed++; - return v; -} - -static void tmem_free(void *p, struct tmem_pool *pool) -{ - if ( pool == NULL || !is_persistent(pool) ) - { - ASSERT( tmem_mempool != NULL ); - xmem_pool_free(p, tmem_mempool); - } - else - { - ASSERT( pool->client->persistent_pool != NULL ); - xmem_pool_free(p, pool->client->persistent_pool); - } -} - -static struct page_info *tmem_alloc_page(struct tmem_pool *pool) -{ - struct page_info *pfp = NULL; - - if ( pool != NULL && is_persistent(pool) ) - pfp = __tmem_alloc_page_thispool(pool->client->domain); - else - pfp = __tmem_alloc_page(); - if ( pfp == NULL ) - tmem_stats.alloc_page_failed++; - else - atomic_inc_and_max(global_page_count); - return pfp; -} - -static void tmem_free_page(struct tmem_pool *pool, struct page_info *pfp) -{ - ASSERT(pfp); - if ( pool == NULL || !is_persistent(pool) ) - __tmem_free_page(pfp); - else - __tmem_free_page_thispool(pfp); - atomic_dec_and_assert(global_page_count); -} - -static void *tmem_mempool_page_get(unsigned long size) -{ - struct page_info *pi; - - ASSERT(size == PAGE_SIZE); - if ( (pi = __tmem_alloc_page()) == NULL ) - return NULL; - return page_to_virt(pi); -} - -static void tmem_mempool_page_put(void *page_va) -{ - ASSERT(IS_PAGE_ALIGNED(page_va)); - __tmem_free_page(virt_to_page(page_va)); -} - -static int __init tmem_mempool_init(void) -{ - tmem_mempool = xmem_pool_create("tmem", tmem_mempool_page_get, - tmem_mempool_page_put, PAGE_SIZE, 0, PAGE_SIZE); - if ( tmem_mempool ) - tmem_mempool_maxalloc = xmem_pool_maxalloc(tmem_mempool); - return tmem_mempool != NULL; -} - -/* Persistent pools are per-domain. */ -static void *tmem_persistent_pool_page_get(unsigned long size) -{ - struct page_info *pi; - struct domain *d = current->domain; - - ASSERT(size == PAGE_SIZE); - if ( (pi = __tmem_alloc_page_thispool(d)) == NULL ) - return NULL; - ASSERT(IS_VALID_PAGE(pi)); - return page_to_virt(pi); -} - -static void tmem_persistent_pool_page_put(void *page_va) -{ - struct page_info *pi; - - ASSERT(IS_PAGE_ALIGNED(page_va)); - pi = mfn_to_page(_mfn(virt_to_mfn(page_va))); - ASSERT(IS_VALID_PAGE(pi)); - __tmem_free_page_thispool(pi); -} - -/* - * Page content descriptor manipulation routines. - */ -#define NOT_SHAREABLE ((uint16_t)-1UL) - -/************ PAGE DESCRIPTOR MANIPULATION ROUTINES *******************/ - -/* Allocate a struct tmem_page_descriptor and associate it with an object. */ -static struct tmem_page_descriptor *pgp_alloc(struct tmem_object_root *obj) -{ - struct tmem_page_descriptor *pgp; - struct tmem_pool *pool; - - ASSERT(obj != NULL); - ASSERT(obj->pool != NULL); - pool = obj->pool; - if ( (pgp = tmem_malloc(sizeof(struct tmem_page_descriptor), pool)) == NULL ) - return NULL; - pgp->us.obj = obj; - INIT_LIST_HEAD(&pgp->global_eph_pages); - INIT_LIST_HEAD(&pgp->us.client_eph_pages); - pgp->pfp = NULL; - pgp->size = -1; - pgp->index = -1; - pgp->timestamp = get_cycles(); - atomic_inc_and_max(global_pgp_count); - atomic_inc(&pool->pgp_count); - if ( _atomic_read(pool->pgp_count) > pool->pgp_count_max ) - pool->pgp_count_max = _atomic_read(pool->pgp_count); - return pgp; -} - -static struct tmem_page_descriptor *pgp_lookup_in_obj(struct tmem_object_root *obj, uint32_t index) -{ - ASSERT(obj != NULL); - ASSERT_SPINLOCK(&obj->obj_spinlock); - ASSERT(obj->pool != NULL); - return radix_tree_lookup(&obj->tree_root, index); -} - -static void pgp_free_data(struct tmem_page_descriptor *pgp, struct tmem_pool *pool) -{ - pagesize_t pgp_size = pgp->size; - - if ( pgp->pfp == NULL ) - return; - if ( pgp_size ) - tmem_free(pgp->cdata, pool); - else - tmem_free_page(pgp->us.obj->pool,pgp->pfp); - if ( pool != NULL && pgp_size ) - { - pool->client->compressed_pages--; - pool->client->compressed_sum_size -= pgp_size; - } - pgp->pfp = NULL; - pgp->size = -1; -} - -static void __pgp_free(struct tmem_page_descriptor *pgp, struct tmem_pool *pool) -{ - pgp->us.obj = NULL; - pgp->index = -1; - tmem_free(pgp, pool); -} - -static void pgp_free(struct tmem_page_descriptor *pgp) -{ - struct tmem_pool *pool = NULL; - - ASSERT(pgp->us.obj != NULL); - ASSERT(pgp->us.obj->pool != NULL); - ASSERT(pgp->us.obj->pool->client != NULL); - - pool = pgp->us.obj->pool; - if ( !is_persistent(pool) ) - { - ASSERT(list_empty(&pgp->global_eph_pages)); - ASSERT(list_empty(&pgp->us.client_eph_pages)); - } - pgp_free_data(pgp, pool); - atomic_dec_and_assert(global_pgp_count); - atomic_dec(&pool->pgp_count); - ASSERT(_atomic_read(pool->pgp_count) >= 0); - pgp->size = -1; - if ( is_persistent(pool) && pool->client->info.flags.u.migrating ) - { - pgp->inv_oid = pgp->us.obj->oid; - pgp->pool_id = pool->pool_id; - return; - } - __pgp_free(pgp, pool); -} - -/* Remove pgp from global/pool/client lists and free it. */ -static void pgp_delist_free(struct tmem_page_descriptor *pgp) -{ - struct client *client; - uint64_t life; - - ASSERT(pgp != NULL); - ASSERT(pgp->us.obj != NULL); - ASSERT(pgp->us.obj->pool != NULL); - client = pgp->us.obj->pool->client; - ASSERT(client != NULL); - - /* Delist pgp. */ - if ( !is_persistent(pgp->us.obj->pool) ) - { - spin_lock(&eph_lists_spinlock); - if ( !list_empty(&pgp->us.client_eph_pages) ) - client->eph_count--; - ASSERT(client->eph_count >= 0); - list_del_init(&pgp->us.client_eph_pages); - if ( !list_empty(&pgp->global_eph_pages) ) - tmem_global.eph_count--; - ASSERT(tmem_global.eph_count >= 0); - list_del_init(&pgp->global_eph_pages); - spin_unlock(&eph_lists_spinlock); - } - else - { - if ( client->info.flags.u.migrating ) - { - spin_lock(&pers_lists_spinlock); - list_add_tail(&pgp->client_inv_pages, - &client->persistent_invalidated_list); - if ( pgp != pgp->us.obj->pool->cur_pgp ) - list_del_init(&pgp->us.pool_pers_pages); - spin_unlock(&pers_lists_spinlock); - } - else - { - spin_lock(&pers_lists_spinlock); - list_del_init(&pgp->us.pool_pers_pages); - spin_unlock(&pers_lists_spinlock); - } - } - life = get_cycles() - pgp->timestamp; - pgp->us.obj->pool->sum_life_cycles += life; - - /* Free pgp. */ - pgp_free(pgp); -} - -/* Called only indirectly by radix_tree_destroy. */ -static void pgp_destroy(void *v) -{ - struct tmem_page_descriptor *pgp = (struct tmem_page_descriptor *)v; - - pgp->us.obj->pgp_count--; - pgp_delist_free(pgp); -} - -static int pgp_add_to_obj(struct tmem_object_root *obj, uint32_t index, struct tmem_page_descriptor *pgp) -{ - int ret; - - ASSERT_SPINLOCK(&obj->obj_spinlock); - ret = radix_tree_insert(&obj->tree_root, index, pgp); - if ( !ret ) - obj->pgp_count++; - return ret; -} - -static struct tmem_page_descriptor *pgp_delete_from_obj(struct tmem_object_root *obj, uint32_t index) -{ - struct tmem_page_descriptor *pgp; - - ASSERT(obj != NULL); - ASSERT_SPINLOCK(&obj->obj_spinlock); - ASSERT(obj->pool != NULL); - pgp = radix_tree_delete(&obj->tree_root, index); - if ( pgp != NULL ) - obj->pgp_count--; - ASSERT(obj->pgp_count >= 0); - - return pgp; -} - -/************ RADIX TREE NODE MANIPULATION ROUTINES *******************/ - -/* Called only indirectly from radix_tree_insert. */ -static struct radix_tree_node *rtn_alloc(void *arg) -{ - struct tmem_object_node *objnode; - struct tmem_object_root *obj = (struct tmem_object_root *)arg; - - ASSERT(obj->pool != NULL); - objnode = tmem_malloc(sizeof(struct tmem_object_node),obj->pool); - if (objnode == NULL) - return NULL; - objnode->obj = obj; - memset(&objnode->rtn, 0, sizeof(struct radix_tree_node)); - if (++obj->pool->objnode_count > obj->pool->objnode_count_max) - obj->pool->objnode_count_max = obj->pool->objnode_count; - atomic_inc_and_max(global_rtree_node_count); - obj->objnode_count++; - return &objnode->rtn; -} - -/* Called only indirectly from radix_tree_delete/destroy. */ -static void rtn_free(struct radix_tree_node *rtn, void *arg) -{ - struct tmem_pool *pool; - struct tmem_object_node *objnode; - - ASSERT(rtn != NULL); - objnode = container_of(rtn,struct tmem_object_node,rtn); - ASSERT(objnode->obj != NULL); - ASSERT_SPINLOCK(&objnode->obj->obj_spinlock); - pool = objnode->obj->pool; - ASSERT(pool != NULL); - pool->objnode_count--; - objnode->obj->objnode_count--; - objnode->obj = NULL; - tmem_free(objnode, pool); - atomic_dec_and_assert(global_rtree_node_count); -} - -/************ POOL OBJECT COLLECTION MANIPULATION ROUTINES *******************/ - -static int oid_compare(struct xen_tmem_oid *left, - struct xen_tmem_oid *right) -{ - if ( left->oid[2] == right->oid[2] ) - { - if ( left->oid[1] == right->oid[1] ) - { - if ( left->oid[0] == right->oid[0] ) - return 0; - else if ( left->oid[0] < right->oid[0] ) - return -1; - else - return 1; - } - else if ( left->oid[1] < right->oid[1] ) - return -1; - else - return 1; - } - else if ( left->oid[2] < right->oid[2] ) - return -1; - else - return 1; -} - -static void oid_set_invalid(struct xen_tmem_oid *oidp) -{ - oidp->oid[0] = oidp->oid[1] = oidp->oid[2] = -1UL; -} - -static unsigned oid_hash(struct xen_tmem_oid *oidp) -{ - return (tmem_hash(oidp->oid[0] ^ oidp->oid[1] ^ oidp->oid[2], - BITS_PER_LONG) & OBJ_HASH_BUCKETS_MASK); -} - -/* Searches for object==oid in pool, returns locked object if found. */ -static struct tmem_object_root * obj_find(struct tmem_pool *pool, - struct xen_tmem_oid *oidp) -{ - struct rb_node *node; - struct tmem_object_root *obj; - -restart_find: - read_lock(&pool->pool_rwlock); - node = pool->obj_rb_root[oid_hash(oidp)].rb_node; - while ( node ) - { - obj = container_of(node, struct tmem_object_root, rb_tree_node); - switch ( oid_compare(&obj->oid, oidp) ) - { - case 0: /* Equal. */ - if ( !spin_trylock(&obj->obj_spinlock) ) - { - read_unlock(&pool->pool_rwlock); - goto restart_find; - } - read_unlock(&pool->pool_rwlock); - return obj; - case -1: - node = node->rb_left; - break; - case 1: - node = node->rb_right; - } - } - read_unlock(&pool->pool_rwlock); - return NULL; -} - -/* Free an object that has no more pgps in it. */ -static void obj_free(struct tmem_object_root *obj) -{ - struct tmem_pool *pool; - struct xen_tmem_oid old_oid; - - ASSERT_SPINLOCK(&obj->obj_spinlock); - ASSERT(obj != NULL); - ASSERT(obj->pgp_count == 0); - pool = obj->pool; - ASSERT(pool != NULL); - ASSERT(pool->client != NULL); - ASSERT_WRITELOCK(&pool->pool_rwlock); - if ( obj->tree_root.rnode != NULL ) /* May be a "stump" with no leaves. */ - radix_tree_destroy(&obj->tree_root, pgp_destroy); - ASSERT((long)obj->objnode_count == 0); - ASSERT(obj->tree_root.rnode == NULL); - pool->obj_count--; - ASSERT(pool->obj_count >= 0); - obj->pool = NULL; - old_oid = obj->oid; - oid_set_invalid(&obj->oid); - obj->last_client = TMEM_CLI_ID_NULL; - atomic_dec_and_assert(global_obj_count); - rb_erase(&obj->rb_tree_node, &pool->obj_rb_root[oid_hash(&old_oid)]); - spin_unlock(&obj->obj_spinlock); - tmem_free(obj, pool); -} - -static int obj_rb_insert(struct rb_root *root, struct tmem_object_root *obj) -{ - struct rb_node **new, *parent = NULL; - struct tmem_object_root *this; - - ASSERT(obj->pool); - ASSERT_WRITELOCK(&obj->pool->pool_rwlock); - - new = &(root->rb_node); - while ( *new ) - { - this = container_of(*new, struct tmem_object_root, rb_tree_node); - parent = *new; - switch ( oid_compare(&this->oid, &obj->oid) ) - { - case 0: - return 0; - case -1: - new = &((*new)->rb_left); - break; - case 1: - new = &((*new)->rb_right); - break; - } - } - rb_link_node(&obj->rb_tree_node, parent, new); - rb_insert_color(&obj->rb_tree_node, root); - return 1; -} - -/* - * Allocate, initialize, and insert an tmem_object_root - * (should be called only if find failed). - */ -static struct tmem_object_root * obj_alloc(struct tmem_pool *pool, - struct xen_tmem_oid *oidp) -{ - struct tmem_object_root *obj; - - ASSERT(pool != NULL); - if ( (obj = tmem_malloc(sizeof(struct tmem_object_root), pool)) == NULL ) - return NULL; - pool->obj_count++; - if (pool->obj_count > pool->obj_count_max) - pool->obj_count_max = pool->obj_count; - atomic_inc_and_max(global_obj_count); - radix_tree_init(&obj->tree_root); - radix_tree_set_alloc_callbacks(&obj->tree_root, rtn_alloc, rtn_free, obj); - spin_lock_init(&obj->obj_spinlock); - obj->pool = pool; - obj->oid = *oidp; - obj->objnode_count = 0; - obj->pgp_count = 0; - obj->last_client = TMEM_CLI_ID_NULL; - return obj; -} - -/* Free an object after destroying any pgps in it. */ -static void obj_destroy(struct tmem_object_root *obj) -{ - ASSERT_WRITELOCK(&obj->pool->pool_rwlock); - radix_tree_destroy(&obj->tree_root, pgp_destroy); - obj_free(obj); -} - -/* Destroys all objs in a pool, or only if obj->last_client matches cli_id. */ -static void pool_destroy_objs(struct tmem_pool *pool, domid_t cli_id) -{ - struct rb_node *node; - struct tmem_object_root *obj; - int i; - - write_lock(&pool->pool_rwlock); - pool->is_dying = 1; - for (i = 0; i < OBJ_HASH_BUCKETS; i++) - { - node = rb_first(&pool->obj_rb_root[i]); - while ( node != NULL ) - { - obj = container_of(node, struct tmem_object_root, rb_tree_node); - spin_lock(&obj->obj_spinlock); - node = rb_next(node); - if ( obj->last_client == cli_id ) - obj_destroy(obj); - else - spin_unlock(&obj->obj_spinlock); - } - } - write_unlock(&pool->pool_rwlock); -} - - -/************ POOL MANIPULATION ROUTINES ******************************/ - -static struct tmem_pool * pool_alloc(void) -{ - struct tmem_pool *pool; - int i; - - if ( (pool = xzalloc(struct tmem_pool)) == NULL ) - return NULL; - for (i = 0; i < OBJ_HASH_BUCKETS; i++) - pool->obj_rb_root[i] = RB_ROOT; - INIT_LIST_HEAD(&pool->persistent_page_list); - rwlock_init(&pool->pool_rwlock); - return pool; -} - -static void pool_free(struct tmem_pool *pool) -{ - pool->client = NULL; - xfree(pool); -} - -/* - * Register new_client as a user of this shared pool and return 0 on succ. - */ -static int shared_pool_join(struct tmem_pool *pool, struct client *new_client) -{ - struct share_list *sl; - ASSERT(is_shared(pool)); - - if ( (sl = tmem_malloc(sizeof(struct share_list), NULL)) == NULL ) - return -1; - sl->client = new_client; - list_add_tail(&sl->share_list, &pool->share_list); - if ( new_client->cli_id != pool->client->cli_id ) - tmem_client_info("adding new %s %d to shared pool owned by %s %d\n", - tmem_client_str, new_client->cli_id, tmem_client_str, - pool->client->cli_id); - else if ( pool->shared_count ) - tmem_client_info("inter-guest sharing of shared pool %s by client %d\n", - tmem_client_str, pool->client->cli_id); - ++pool->shared_count; - return 0; -} - -/* Reassign "ownership" of the pool to another client that shares this pool. */ -static void shared_pool_reassign(struct tmem_pool *pool) -{ - struct share_list *sl; - int poolid; - struct client *old_client = pool->client, *new_client; - - ASSERT(is_shared(pool)); - if ( list_empty(&pool->share_list) ) - { - ASSERT(pool->shared_count == 0); - return; - } - old_client->pools[pool->pool_id] = NULL; - sl = list_entry(pool->share_list.next, struct share_list, share_list); - /* - * The sl->client can be old_client if there are multiple shared pools - * within an guest. - */ - pool->client = new_client = sl->client; - for (poolid = 0; poolid < MAX_POOLS_PER_DOMAIN; poolid++) - if (new_client->pools[poolid] == pool) - break; - ASSERT(poolid != MAX_POOLS_PER_DOMAIN); - new_client->eph_count += _atomic_read(pool->pgp_count); - old_client->eph_count -= _atomic_read(pool->pgp_count); - list_splice_init(&old_client->ephemeral_page_list, - &new_client->ephemeral_page_list); - tmem_client_info("reassigned shared pool from %s=%d to %s=%d pool_id=%d\n", - tmem_cli_id_str, old_client->cli_id, tmem_cli_id_str, new_client->cli_id, poolid); - pool->pool_id = poolid; -} - -/* - * Destroy all objects with last_client same as passed cli_id, - * remove pool's cli_id from list of sharers of this pool. - */ -static int shared_pool_quit(struct tmem_pool *pool, domid_t cli_id) -{ - struct share_list *sl; - int s_poolid; - - ASSERT(is_shared(pool)); - ASSERT(pool->client != NULL); - - ASSERT_WRITELOCK(&tmem_rwlock); - pool_destroy_objs(pool, cli_id); - list_for_each_entry(sl,&pool->share_list, share_list) - { - if (sl->client->cli_id != cli_id) - continue; - list_del(&sl->share_list); - tmem_free(sl, pool); - --pool->shared_count; - if (pool->client->cli_id == cli_id) - shared_pool_reassign(pool); - if (pool->shared_count) - return pool->shared_count; - for (s_poolid = 0; s_poolid < MAX_GLOBAL_SHARED_POOLS; s_poolid++) - if ( (tmem_global.shared_pools[s_poolid]) == pool ) - { - tmem_global.shared_pools[s_poolid] = NULL; - break; - } - return 0; - } - tmem_client_warn("tmem: no match unsharing pool, %s=%d\n", - tmem_cli_id_str,pool->client->cli_id); - return -1; -} - -/* Flush all data (owned by cli_id) from a pool and, optionally, free it. */ -static void pool_flush(struct tmem_pool *pool, domid_t cli_id) -{ - ASSERT(pool != NULL); - if ( (is_shared(pool)) && (shared_pool_quit(pool,cli_id) > 0) ) - { - tmem_client_warn("tmem: %s=%d no longer using shared pool %d owned by %s=%d\n", - tmem_cli_id_str, cli_id, pool->pool_id, tmem_cli_id_str,pool->client->cli_id); - return; - } - tmem_client_info("Destroying %s-%s tmem pool %s=%d pool_id=%d\n", - is_persistent(pool) ? "persistent" : "ephemeral" , - is_shared(pool) ? "shared" : "private", - tmem_cli_id_str, pool->client->cli_id, pool->pool_id); - if ( pool->client->info.flags.u.migrating ) - { - tmem_client_warn("can't destroy pool while %s is live-migrating\n", - tmem_client_str); - return; - } - pool_destroy_objs(pool, TMEM_CLI_ID_NULL); - pool->client->pools[pool->pool_id] = NULL; - pool_free(pool); -} - -/************ CLIENT MANIPULATION OPERATIONS **************************/ - -struct client *client_create(domid_t cli_id) -{ - struct client *client = xzalloc(struct client); - int i, shift; - char name[5]; - struct domain *d; - - tmem_client_info("tmem: initializing tmem capability for %s=%d...", - tmem_cli_id_str, cli_id); - if ( client == NULL ) - { - tmem_client_err("failed... out of memory\n"); - goto fail; - } - - for (i = 0, shift = 12; i < 4; shift -=4, i++) - name[i] = (((unsigned short)cli_id >> shift) & 0xf) + '0'; - name[4] = '\0'; - client->persistent_pool = xmem_pool_create(name, tmem_persistent_pool_page_get, - tmem_persistent_pool_page_put, PAGE_SIZE, 0, PAGE_SIZE); - if ( client->persistent_pool == NULL ) - { - tmem_client_err("failed... can't alloc persistent pool\n"); - goto fail; - } - - d = rcu_lock_domain_by_id(cli_id); - if ( d == NULL ) { - tmem_client_err("failed... can't set client\n"); - xmem_pool_destroy(client->persistent_pool); - goto fail; - } - if ( !d->is_dying ) { - d->tmem_client = client; - client->domain = d; - } - rcu_unlock_domain(d); - - client->cli_id = cli_id; - client->info.version = TMEM_SPEC_VERSION; - client->info.maxpools = MAX_POOLS_PER_DOMAIN; - client->info.flags.u.compress = tmem_compression_enabled(); - for ( i = 0; i < MAX_GLOBAL_SHARED_POOLS; i++) - client->shared_auth_uuid[i][0] = - client->shared_auth_uuid[i][1] = -1L; - list_add_tail(&client->client_list, &tmem_global.client_list); - INIT_LIST_HEAD(&client->ephemeral_page_list); - INIT_LIST_HEAD(&client->persistent_invalidated_list); - tmem_client_info("ok\n"); - return client; - - fail: - xfree(client); - return NULL; -} - -static void client_free(struct client *client) -{ - list_del(&client->client_list); - xmem_pool_destroy(client->persistent_pool); - xfree(client); -} - -/* Flush all data from a client and, optionally, free it. */ -static void client_flush(struct client *client) -{ - int i; - struct tmem_pool *pool; - - for (i = 0; i < MAX_POOLS_PER_DOMAIN; i++) - { - if ( (pool = client->pools[i]) == NULL ) - continue; - pool_flush(pool, client->cli_id); - client->pools[i] = NULL; - client->info.nr_pools--; - } - client_free(client); -} - -static bool client_over_quota(const struct client *client) -{ - int total = _atomic_read(tmem_global.client_weight_total); - - ASSERT(client != NULL); - if ( (total == 0) || (client->info.weight == 0) || - (client->eph_count == 0) ) - return false; - - return (((tmem_global.eph_count * 100L) / client->eph_count) > - ((total * 100L) / client->info.weight)); -} - -/************ MEMORY REVOCATION ROUTINES *******************************/ - -static bool tmem_try_to_evict_pgp(struct tmem_page_descriptor *pgp, - bool *hold_pool_rwlock) -{ - struct tmem_object_root *obj = pgp->us.obj; - struct tmem_pool *pool = obj->pool; - - if ( pool->is_dying ) - return false; - if ( spin_trylock(&obj->obj_spinlock) ) - { - if ( obj->pgp_count > 1 ) - return true; - if ( write_trylock(&pool->pool_rwlock) ) - { - *hold_pool_rwlock = 1; - return true; - } - spin_unlock(&obj->obj_spinlock); - } - return false; -} - -int tmem_evict(void) -{ - struct client *client = current->domain->tmem_client; - struct tmem_page_descriptor *pgp = NULL, *pgp_del; - struct tmem_object_root *obj; - struct tmem_pool *pool; - int ret = 0; - bool hold_pool_rwlock = false; - - tmem_stats.evict_attempts++; - spin_lock(&eph_lists_spinlock); - if ( (client != NULL) && client_over_quota(client) && - !list_empty(&client->ephemeral_page_list) ) - { - list_for_each_entry(pgp, &client->ephemeral_page_list, us.client_eph_pages) - if ( tmem_try_to_evict_pgp(pgp, &hold_pool_rwlock) ) - goto found; - } - else if ( !list_empty(&tmem_global.ephemeral_page_list) ) - { - list_for_each_entry(pgp, &tmem_global.ephemeral_page_list, global_eph_pages) - if ( tmem_try_to_evict_pgp(pgp, &hold_pool_rwlock) ) - { - client = pgp->us.obj->pool->client; - goto found; - } - } - /* Global_ephemeral_page_list is empty, so we bail out. */ - spin_unlock(&eph_lists_spinlock); - goto out; - -found: - /* Delist. */ - list_del_init(&pgp->us.client_eph_pages); - client->eph_count--; - list_del_init(&pgp->global_eph_pages); - tmem_global.eph_count--; - ASSERT(tmem_global.eph_count >= 0); - ASSERT(client->eph_count >= 0); - spin_unlock(&eph_lists_spinlock); - - ASSERT(pgp != NULL); - obj = pgp->us.obj; - ASSERT(obj != NULL); - ASSERT(obj->pool != NULL); - pool = obj->pool; - - ASSERT_SPINLOCK(&obj->obj_spinlock); - pgp_del = pgp_delete_from_obj(obj, pgp->index); - ASSERT(pgp_del == pgp); - - /* pgp already delist, so call pgp_free directly. */ - pgp_free(pgp); - if ( obj->pgp_count == 0 ) - { - ASSERT_WRITELOCK(&pool->pool_rwlock); - obj_free(obj); - } - else - spin_unlock(&obj->obj_spinlock); - if ( hold_pool_rwlock ) - write_unlock(&pool->pool_rwlock); - tmem_stats.evicted_pgs++; - ret = 1; -out: - return ret; -} - - -/* - * Under certain conditions (e.g. if each client is putting pages for exactly - * one object), once locks are held, freeing up memory may - * result in livelocks and very long "put" times, so we try to ensure there - * is a minimum amount of memory (1MB) available BEFORE any data structure - * locks are held. - */ -static inline bool tmem_ensure_avail_pages(void) -{ - int failed_evict = 10; - unsigned long free_mem; - - do { - free_mem = (tmem_page_list_pages + total_free_pages()) - >> (20 - PAGE_SHIFT); - if ( free_mem ) - return true; - if ( !tmem_evict() ) - failed_evict--; - } while ( failed_evict > 0 ); - - return false; -} - -/************ TMEM CORE OPERATIONS ************************************/ - -static int do_tmem_put_compress(struct tmem_page_descriptor *pgp, xen_pfn_t cmfn, - tmem_cli_va_param_t clibuf) -{ - void *dst, *p; - size_t size; - int ret = 0; - - ASSERT(pgp != NULL); - ASSERT(pgp->us.obj != NULL); - ASSERT_SPINLOCK(&pgp->us.obj->obj_spinlock); - ASSERT(pgp->us.obj->pool != NULL); - ASSERT(pgp->us.obj->pool->client != NULL); - - if ( pgp->pfp != NULL ) - pgp_free_data(pgp, pgp->us.obj->pool); - ret = tmem_compress_from_client(cmfn, &dst, &size, clibuf); - if ( ret <= 0 ) - goto out; - else if ( (size == 0) || (size >= tmem_mempool_maxalloc) ) { - ret = 0; - goto out; - } else if ( (p = tmem_malloc(size,pgp->us.obj->pool)) == NULL ) { - ret = -ENOMEM; - goto out; - } else { - memcpy(p,dst,size); - pgp->cdata = p; - } - pgp->size = size; - pgp->us.obj->pool->client->compressed_pages++; - pgp->us.obj->pool->client->compressed_sum_size += size; - ret = 1; - -out: - return ret; -} - -static int do_tmem_dup_put(struct tmem_page_descriptor *pgp, xen_pfn_t cmfn, - tmem_cli_va_param_t clibuf) -{ - struct tmem_pool *pool; - struct tmem_object_root *obj; - struct client *client; - struct tmem_page_descriptor *pgpfound = NULL; - int ret; - - ASSERT(pgp != NULL); - ASSERT(pgp->pfp != NULL); - ASSERT(pgp->size != -1); - obj = pgp->us.obj; - ASSERT_SPINLOCK(&obj->obj_spinlock); - ASSERT(obj != NULL); - pool = obj->pool; - ASSERT(pool != NULL); - client = pool->client; - if ( client->info.flags.u.migrating ) - goto failed_dup; /* No dups allowed when migrating. */ - /* Can we successfully manipulate pgp to change out the data? */ - if ( client->info.flags.u.compress && pgp->size != 0 ) - { - ret = do_tmem_put_compress(pgp, cmfn, clibuf); - if ( ret == 1 ) - goto done; - else if ( ret == 0 ) - goto copy_uncompressed; - else if ( ret == -ENOMEM ) - goto failed_dup; - else if ( ret == -EFAULT ) - goto bad_copy; - } - -copy_uncompressed: - if ( pgp->pfp ) - pgp_free_data(pgp, pool); - if ( ( pgp->pfp = tmem_alloc_page(pool) ) == NULL ) - goto failed_dup; - pgp->size = 0; - ret = tmem_copy_from_client(pgp->pfp, cmfn, tmem_cli_buf_null); - if ( ret < 0 ) - goto bad_copy; - -done: - /* Successfully replaced data, clean up and return success. */ - if ( is_shared(pool) ) - obj->last_client = client->cli_id; - spin_unlock(&obj->obj_spinlock); - pool->dup_puts_replaced++; - pool->good_puts++; - if ( is_persistent(pool) ) - client->succ_pers_puts++; - return 1; - -bad_copy: - tmem_stats.failed_copies++; - goto cleanup; - -failed_dup: - /* - * Couldn't change out the data, flush the old data and return - * -ENOSPC instead of -ENOMEM to differentiate failed _dup_ put. - */ - ret = -ENOSPC; -cleanup: - pgpfound = pgp_delete_from_obj(obj, pgp->index); - ASSERT(pgpfound == pgp); - pgp_delist_free(pgpfound); - if ( obj->pgp_count == 0 ) - { - write_lock(&pool->pool_rwlock); - obj_free(obj); - write_unlock(&pool->pool_rwlock); - } else { - spin_unlock(&obj->obj_spinlock); - } - pool->dup_puts_flushed++; - return ret; -} - -static int do_tmem_put(struct tmem_pool *pool, - struct xen_tmem_oid *oidp, uint32_t index, - xen_pfn_t cmfn, tmem_cli_va_param_t clibuf) -{ - struct tmem_object_root *obj = NULL; - struct tmem_page_descriptor *pgp = NULL; - struct client *client; - int ret, newobj = 0; - - ASSERT(pool != NULL); - client = pool->client; - ASSERT(client != NULL); - ret = client->info.flags.u.frozen ? -EFROZEN : -ENOMEM; - pool->puts++; - -refind: - /* Does page already exist (dup)? if so, handle specially. */ - if ( (obj = obj_find(pool, oidp)) != NULL ) - { - if ((pgp = pgp_lookup_in_obj(obj, index)) != NULL) - { - return do_tmem_dup_put(pgp, cmfn, clibuf); - } - else - { - /* No puts allowed into a frozen pool (except dup puts). */ - if ( client->info.flags.u.frozen ) - goto unlock_obj; - } - } - else - { - /* No puts allowed into a frozen pool (except dup puts). */ - if ( client->info.flags.u.frozen ) - return ret; - if ( (obj = obj_alloc(pool, oidp)) == NULL ) - return -ENOMEM; - - write_lock(&pool->pool_rwlock); - /* - * Parallel callers may already allocated obj and inserted to obj_rb_root - * before us. - */ - if ( !obj_rb_insert(&pool->obj_rb_root[oid_hash(oidp)], obj) ) - { - tmem_free(obj, pool); - write_unlock(&pool->pool_rwlock); - goto refind; - } - - spin_lock(&obj->obj_spinlock); - newobj = 1; - write_unlock(&pool->pool_rwlock); - } - - /* When arrive here, we have a spinlocked obj for use. */ - ASSERT_SPINLOCK(&obj->obj_spinlock); - if ( (pgp = pgp_alloc(obj)) == NULL ) - goto unlock_obj; - - ret = pgp_add_to_obj(obj, index, pgp); - if ( ret == -ENOMEM ) - /* Warning: may result in partially built radix tree ("stump"). */ - goto free_pgp; - - pgp->index = index; - pgp->size = 0; - - if ( client->info.flags.u.compress ) - { - ASSERT(pgp->pfp == NULL); - ret = do_tmem_put_compress(pgp, cmfn, clibuf); - if ( ret == 1 ) - goto insert_page; - if ( ret == -ENOMEM ) - { - client->compress_nomem++; - goto del_pgp_from_obj; - } - if ( ret == 0 ) - { - client->compress_poor++; - goto copy_uncompressed; - } - if ( ret == -EFAULT ) - goto bad_copy; - } - -copy_uncompressed: - if ( ( pgp->pfp = tmem_alloc_page(pool) ) == NULL ) - { - ret = -ENOMEM; - goto del_pgp_from_obj; - } - ret = tmem_copy_from_client(pgp->pfp, cmfn, clibuf); - if ( ret < 0 ) - goto bad_copy; - -insert_page: - if ( !is_persistent(pool) ) - { - spin_lock(&eph_lists_spinlock); - list_add_tail(&pgp->global_eph_pages, &tmem_global.ephemeral_page_list); - if (++tmem_global.eph_count > tmem_stats.global_eph_count_max) - tmem_stats.global_eph_count_max = tmem_global.eph_count; - list_add_tail(&pgp->us.client_eph_pages, - &client->ephemeral_page_list); - if (++client->eph_count > client->eph_count_max) - client->eph_count_max = client->eph_count; - spin_unlock(&eph_lists_spinlock); - } - else - { /* is_persistent. */ - spin_lock(&pers_lists_spinlock); - list_add_tail(&pgp->us.pool_pers_pages, - &pool->persistent_page_list); - spin_unlock(&pers_lists_spinlock); - } - - if ( is_shared(pool) ) - obj->last_client = client->cli_id; - - /* Free the obj spinlock. */ - spin_unlock(&obj->obj_spinlock); - pool->good_puts++; - - if ( is_persistent(pool) ) - client->succ_pers_puts++; - else - tmem_stats.tot_good_eph_puts++; - return 1; - -bad_copy: - tmem_stats.failed_copies++; - -del_pgp_from_obj: - ASSERT((obj != NULL) && (pgp != NULL) && (pgp->index != -1)); - pgp_delete_from_obj(obj, pgp->index); - -free_pgp: - pgp_free(pgp); -unlock_obj: - if ( newobj ) - { - write_lock(&pool->pool_rwlock); - obj_free(obj); - write_unlock(&pool->pool_rwlock); - } - else - { - spin_unlock(&obj->obj_spinlock); - } - pool->no_mem_puts++; - return ret; -} - -static int do_tmem_get(struct tmem_pool *pool, - struct xen_tmem_oid *oidp, uint32_t index, - xen_pfn_t cmfn, tmem_cli_va_param_t clibuf) -{ - struct tmem_object_root *obj; - struct tmem_page_descriptor *pgp; - struct client *client = pool->client; - int rc; - - if ( !_atomic_read(pool->pgp_count) ) - return -EEMPTY; - - pool->gets++; - obj = obj_find(pool,oidp); - if ( obj == NULL ) - return 0; - - ASSERT_SPINLOCK(&obj->obj_spinlock); - if (is_shared(pool) || is_persistent(pool) ) - pgp = pgp_lookup_in_obj(obj, index); - else - pgp = pgp_delete_from_obj(obj, index); - if ( pgp == NULL ) - { - spin_unlock(&obj->obj_spinlock); - return 0; - } - ASSERT(pgp->size != -1); - if ( pgp->size != 0 ) - { - rc = tmem_decompress_to_client(cmfn, pgp->cdata, pgp->size, clibuf); - } - else - rc = tmem_copy_to_client(cmfn, pgp->pfp, clibuf); - if ( rc <= 0 ) - goto bad_copy; - - if ( !is_persistent(pool) ) - { - if ( !is_shared(pool) ) - { - pgp_delist_free(pgp); - if ( obj->pgp_count == 0 ) - { - write_lock(&pool->pool_rwlock); - obj_free(obj); - obj = NULL; - write_unlock(&pool->pool_rwlock); - } - } else { - spin_lock(&eph_lists_spinlock); - list_del(&pgp->global_eph_pages); - list_add_tail(&pgp->global_eph_pages,&tmem_global.ephemeral_page_list); - list_del(&pgp->us.client_eph_pages); - list_add_tail(&pgp->us.client_eph_pages,&client->ephemeral_page_list); - spin_unlock(&eph_lists_spinlock); - obj->last_client = current->domain->domain_id; - } - } - if ( obj != NULL ) - { - spin_unlock(&obj->obj_spinlock); - } - pool->found_gets++; - if ( is_persistent(pool) ) - client->succ_pers_gets++; - else - client->succ_eph_gets++; - return 1; - -bad_copy: - spin_unlock(&obj->obj_spinlock); - tmem_stats.failed_copies++; - return rc; -} - -static int do_tmem_flush_page(struct tmem_pool *pool, - struct xen_tmem_oid *oidp, uint32_t index) -{ - struct tmem_object_root *obj; - struct tmem_page_descriptor *pgp; - - pool->flushs++; - obj = obj_find(pool,oidp); - if ( obj == NULL ) - goto out; - pgp = pgp_delete_from_obj(obj, index); - if ( pgp == NULL ) - { - spin_unlock(&obj->obj_spinlock); - goto out; - } - pgp_delist_free(pgp); - if ( obj->pgp_count == 0 ) - { - write_lock(&pool->pool_rwlock); - obj_free(obj); - write_unlock(&pool->pool_rwlock); - } else { - spin_unlock(&obj->obj_spinlock); - } - pool->flushs_found++; - -out: - if ( pool->client->info.flags.u.frozen ) - return -EFROZEN; - else - return 1; -} - -static int do_tmem_flush_object(struct tmem_pool *pool, - struct xen_tmem_oid *oidp) -{ - struct tmem_object_root *obj; - - pool->flush_objs++; - obj = obj_find(pool,oidp); - if ( obj == NULL ) - goto out; - write_lock(&pool->pool_rwlock); - obj_destroy(obj); - pool->flush_objs_found++; - write_unlock(&pool->pool_rwlock); - -out: - if ( pool->client->info.flags.u.frozen ) - return -EFROZEN; - else - return 1; -} - -static int do_tmem_destroy_pool(uint32_t pool_id) -{ - struct client *client = current->domain->tmem_client; - struct tmem_pool *pool; - - if ( pool_id >= MAX_POOLS_PER_DOMAIN ) - return 0; - if ( (pool = client->pools[pool_id]) == NULL ) - return 0; - client->pools[pool_id] = NULL; - pool_flush(pool, client->cli_id); - client->info.nr_pools--; - return 1; -} - -int do_tmem_new_pool(domid_t this_cli_id, - uint32_t d_poolid, uint32_t flags, - uint64_t uuid_lo, uint64_t uuid_hi) -{ - struct client *client; - domid_t cli_id; - int persistent = flags & TMEM_POOL_PERSIST; - int shared = flags & TMEM_POOL_SHARED; - int pagebits = (flags >> TMEM_POOL_PAGESIZE_SHIFT) - & TMEM_POOL_PAGESIZE_MASK; - int specversion = (flags >> TMEM_POOL_VERSION_SHIFT) - & TMEM_POOL_VERSION_MASK; - struct tmem_pool *pool, *shpool; - int i, first_unused_s_poolid; - - if ( this_cli_id == TMEM_CLI_ID_NULL ) - cli_id = current->domain->domain_id; - else - cli_id = this_cli_id; - tmem_client_info("tmem: allocating %s-%s tmem pool for %s=%d...", - persistent ? "persistent" : "ephemeral" , - shared ? "shared" : "private", tmem_cli_id_str, cli_id); - if ( specversion != TMEM_SPEC_VERSION ) - { - tmem_client_err("failed... unsupported spec version\n"); - return -EPERM; - } - if ( shared && persistent ) - { - tmem_client_err("failed... unable to create a shared-persistant pool\n"); - return -EPERM; - } - if ( pagebits != (PAGE_SHIFT - 12) ) - { - tmem_client_err("failed... unsupported pagesize %d\n", - 1 << (pagebits + 12)); - return -EPERM; - } - if ( flags & TMEM_POOL_PRECOMPRESSED ) - { - tmem_client_err("failed... precompression flag set but unsupported\n"); - return -EPERM; - } - if ( flags & TMEM_POOL_RESERVED_BITS ) - { - tmem_client_err("failed... reserved bits must be zero\n"); - return -EPERM; - } - if ( this_cli_id != TMEM_CLI_ID_NULL ) - { - if ( (client = tmem_client_from_cli_id(this_cli_id)) == NULL - || d_poolid >= MAX_POOLS_PER_DOMAIN - || client->pools[d_poolid] != NULL ) - return -EPERM; - } - else - { - client = current->domain->tmem_client; - ASSERT(client != NULL); - for ( d_poolid = 0; d_poolid < MAX_POOLS_PER_DOMAIN; d_poolid++ ) - if ( client->pools[d_poolid] == NULL ) - break; - if ( d_poolid >= MAX_POOLS_PER_DOMAIN ) - { - tmem_client_err("failed... no more pool slots available for this %s\n", - tmem_client_str); - return -EPERM; - } - } - - if ( (pool = pool_alloc()) == NULL ) - { - tmem_client_err("failed... out of memory\n"); - return -ENOMEM; - } - client->pools[d_poolid] = pool; - pool->client = client; - pool->pool_id = d_poolid; - pool->shared = shared; - pool->persistent = persistent; - pool->uuid[0] = uuid_lo; - pool->uuid[1] = uuid_hi; - - /* - * Already created a pool when arrived here, but need some special process - * for shared pool. - */ - if ( shared ) - { - if ( uuid_lo == -1L && uuid_hi == -1L ) - { - tmem_client_info("Invalid uuid, create non shared pool instead!\n"); - pool->shared = 0; - goto out; - } - if ( !tmem_global.shared_auth ) - { - for ( i = 0; i < MAX_GLOBAL_SHARED_POOLS; i++) - if ( (client->shared_auth_uuid[i][0] == uuid_lo) && - (client->shared_auth_uuid[i][1] == uuid_hi) ) - break; - if ( i == MAX_GLOBAL_SHARED_POOLS ) - { - tmem_client_info("Shared auth failed, create non shared pool instead!\n"); - pool->shared = 0; - goto out; - } - } - - /* - * Authorize okay, match a global shared pool or use the newly allocated - * one. - */ - first_unused_s_poolid = MAX_GLOBAL_SHARED_POOLS; - for ( i = 0; i < MAX_GLOBAL_SHARED_POOLS; i++ ) - { - if ( (shpool = tmem_global.shared_pools[i]) != NULL ) - { - if ( shpool->uuid[0] == uuid_lo && shpool->uuid[1] == uuid_hi ) - { - /* Succ to match a global shared pool. */ - tmem_client_info("(matches shared pool uuid=%"PRIx64".%"PRIx64") pool_id=%d\n", - uuid_hi, uuid_lo, d_poolid); - client->pools[d_poolid] = shpool; - if ( !shared_pool_join(shpool, client) ) - { - pool_free(pool); - goto out; - } - else - goto fail; - } - } - else - { - if ( first_unused_s_poolid == MAX_GLOBAL_SHARED_POOLS ) - first_unused_s_poolid = i; - } - } - - /* Failed to find a global shared pool slot. */ - if ( first_unused_s_poolid == MAX_GLOBAL_SHARED_POOLS ) - { - tmem_client_warn("tmem: failed... no global shared pool slots available\n"); - goto fail; - } - /* Add pool to global shared pool. */ - else - { - INIT_LIST_HEAD(&pool->share_list); - pool->shared_count = 0; - if ( shared_pool_join(pool, client) ) - goto fail; - tmem_global.shared_pools[first_unused_s_poolid] = pool; - } - } - -out: - tmem_client_info("pool_id=%d\n", d_poolid); - client->info.nr_pools++; - return d_poolid; - -fail: - pool_free(pool); - return -EPERM; -} - -/************ TMEM CONTROL OPERATIONS ************************************/ - -int tmemc_shared_pool_auth(domid_t cli_id, uint64_t uuid_lo, - uint64_t uuid_hi, bool auth) -{ - struct client *client; - int i, free = -1; - - if ( cli_id == TMEM_CLI_ID_NULL ) - { - tmem_global.shared_auth = auth; - return 1; - } - client = tmem_client_from_cli_id(cli_id); - if ( client == NULL ) - return -EINVAL; - - for ( i = 0; i < MAX_GLOBAL_SHARED_POOLS; i++) - { - if ( auth == 0 ) - { - if ( (client->shared_auth_uuid[i][0] == uuid_lo) && - (client->shared_auth_uuid[i][1] == uuid_hi) ) - { - client->shared_auth_uuid[i][0] = -1L; - client->shared_auth_uuid[i][1] = -1L; - return 1; - } - } - else - { - if ( (client->shared_auth_uuid[i][0] == -1L) && - (client->shared_auth_uuid[i][1] == -1L) ) - { - free = i; - break; - } - } - } - if ( auth == 0 ) - return 0; - else if ( free == -1) - return -ENOMEM; - else - { - client->shared_auth_uuid[free][0] = uuid_lo; - client->shared_auth_uuid[free][1] = uuid_hi; - return 1; - } -} - -static int tmemc_save_subop(int cli_id, uint32_t pool_id, - uint32_t subop, tmem_cli_va_param_t buf, uint32_t arg) -{ - struct client *client = tmem_client_from_cli_id(cli_id); - uint32_t p; - struct tmem_page_descriptor *pgp, *pgp2; - int rc = -ENOENT; - - switch(subop) - { - case XEN_SYSCTL_TMEM_OP_SAVE_BEGIN: - if ( client == NULL ) - break; - for (p = 0; p < MAX_POOLS_PER_DOMAIN; p++) - if ( client->pools[p] != NULL ) - break; - - if ( p == MAX_POOLS_PER_DOMAIN ) - break; - - client->was_frozen = client->info.flags.u.frozen; - client->info.flags.u.frozen = 1; - if ( arg != 0 ) - client->info.flags.u.migrating = 1; - rc = 0; - break; - case XEN_SYSCTL_TMEM_OP_RESTORE_BEGIN: - if ( client == NULL ) - rc = client_create(cli_id) ? 0 : -ENOMEM; - else - rc = -EEXIST; - break; - case XEN_SYSCTL_TMEM_OP_SAVE_END: - if ( client == NULL ) - break; - client->info.flags.u.migrating = 0; - if ( !list_empty(&client->persistent_invalidated_list) ) - list_for_each_entry_safe(pgp,pgp2, - &client->persistent_invalidated_list, client_inv_pages) - __pgp_free(pgp, client->pools[pgp->pool_id]); - client->info.flags.u.frozen = client->was_frozen; - rc = 0; - break; - } - return rc; -} - -static int tmemc_save_get_next_page(int cli_id, uint32_t pool_id, - tmem_cli_va_param_t buf, uint32_t bufsize) -{ - struct client *client = tmem_client_from_cli_id(cli_id); - struct tmem_pool *pool = (client == NULL || pool_id >= MAX_POOLS_PER_DOMAIN) - ? NULL : client->pools[pool_id]; - struct tmem_page_descriptor *pgp; - struct xen_tmem_oid *oid; - int ret = 0; - struct tmem_handle h; - - if ( pool == NULL || !is_persistent(pool) ) - return -1; - - if ( bufsize < PAGE_SIZE + sizeof(struct tmem_handle) ) - return -ENOMEM; - - spin_lock(&pers_lists_spinlock); - if ( list_empty(&pool->persistent_page_list) ) - { - ret = -1; - goto out; - } - /* Note: pool->cur_pgp is the pgp last returned by get_next_page. */ - if ( pool->cur_pgp == NULL ) - { - /* Process the first one. */ - pool->cur_pgp = pgp = list_entry((&pool->persistent_page_list)->next, - struct tmem_page_descriptor,us.pool_pers_pages); - } else if ( list_is_last(&pool->cur_pgp->us.pool_pers_pages, - &pool->persistent_page_list) ) - { - /* Already processed the last one in the list. */ - ret = -1; - goto out; - } - pgp = list_entry((&pool->cur_pgp->us.pool_pers_pages)->next, - struct tmem_page_descriptor,us.pool_pers_pages); - pool->cur_pgp = pgp; - oid = &pgp->us.obj->oid; - h.pool_id = pool_id; - BUILD_BUG_ON(sizeof(h.oid) != sizeof(*oid)); - memcpy(&(h.oid), oid, sizeof(h.oid)); - h.index = pgp->index; - if ( copy_to_guest(guest_handle_cast(buf, void), &h, 1) ) - { - ret = -EFAULT; - goto out; - } - guest_handle_add_offset(buf, sizeof(h)); - ret = do_tmem_get(pool, oid, pgp->index, 0, buf); - -out: - spin_unlock(&pers_lists_spinlock); - return ret; -} - -static int tmemc_save_get_next_inv(int cli_id, tmem_cli_va_param_t buf, - uint32_t bufsize) -{ - struct client *client = tmem_client_from_cli_id(cli_id); - struct tmem_page_descriptor *pgp; - struct tmem_handle h; - int ret = 0; - - if ( client == NULL ) - return 0; - if ( bufsize < sizeof(struct tmem_handle) ) - return 0; - spin_lock(&pers_lists_spinlock); - if ( list_empty(&client->persistent_invalidated_list) ) - goto out; - if ( client->cur_pgp == NULL ) - { - pgp = list_entry((&client->persistent_invalidated_list)->next, - struct tmem_page_descriptor,client_inv_pages); - client->cur_pgp = pgp; - } else if ( list_is_last(&client->cur_pgp->client_inv_pages, - &client->persistent_invalidated_list) ) - { - client->cur_pgp = NULL; - ret = 0; - goto out; - } else { - pgp = list_entry((&client->cur_pgp->client_inv_pages)->next, - struct tmem_page_descriptor,client_inv_pages); - client->cur_pgp = pgp; - } - h.pool_id = pgp->pool_id; - BUILD_BUG_ON(sizeof(h.oid) != sizeof(pgp->inv_oid)); - memcpy(&(h.oid), &(pgp->inv_oid), sizeof(h.oid)); - h.index = pgp->index; - ret = 1; - if ( copy_to_guest(guest_handle_cast(buf, void), &h, 1) ) - ret = -EFAULT; -out: - spin_unlock(&pers_lists_spinlock); - return ret; -} - -static int tmemc_restore_put_page(int cli_id, uint32_t pool_id, - struct xen_tmem_oid *oidp, - uint32_t index, tmem_cli_va_param_t buf, - uint32_t bufsize) -{ - struct client *client = tmem_client_from_cli_id(cli_id); - struct tmem_pool *pool = (client == NULL || pool_id >= MAX_POOLS_PER_DOMAIN) - ? NULL : client->pools[pool_id]; - - if ( pool == NULL ) - return -1; - if (bufsize != PAGE_SIZE) { - tmem_client_err("tmem: %s: invalid parameter bufsize(%d) != (%ld)\n", - __func__, bufsize, PAGE_SIZE); - return -EINVAL; - } - return do_tmem_put(pool, oidp, index, 0, buf); -} - -static int tmemc_restore_flush_page(int cli_id, uint32_t pool_id, - struct xen_tmem_oid *oidp, - uint32_t index) -{ - struct client *client = tmem_client_from_cli_id(cli_id); - struct tmem_pool *pool = (client == NULL || pool_id >= MAX_POOLS_PER_DOMAIN) - ? NULL : client->pools[pool_id]; - - if ( pool == NULL ) - return -1; - return do_tmem_flush_page(pool,oidp,index); -} - -int do_tmem_control(struct xen_sysctl_tmem_op *op) -{ - int ret; - uint32_t pool_id = op->pool_id; - uint32_t cmd = op->cmd; - struct xen_tmem_oid *oidp = &op->oid; - - ASSERT(rw_is_write_locked(&tmem_rwlock)); - - switch (cmd) - { - case XEN_SYSCTL_TMEM_OP_SAVE_BEGIN: - case XEN_SYSCTL_TMEM_OP_RESTORE_BEGIN: - case XEN_SYSCTL_TMEM_OP_SAVE_END: - ret = tmemc_save_subop(op->cli_id, pool_id, cmd, - guest_handle_cast(op->u.buf, char), op->arg); - break; - case XEN_SYSCTL_TMEM_OP_SAVE_GET_NEXT_PAGE: - ret = tmemc_save_get_next_page(op->cli_id, pool_id, - guest_handle_cast(op->u.buf, char), op->len); - break; - case XEN_SYSCTL_TMEM_OP_SAVE_GET_NEXT_INV: - ret = tmemc_save_get_next_inv(op->cli_id, - guest_handle_cast(op->u.buf, char), op->len); - break; - case XEN_SYSCTL_TMEM_OP_RESTORE_PUT_PAGE: - ret = tmemc_restore_put_page(op->cli_id, pool_id, oidp, op->arg, - guest_handle_cast(op->u.buf, char), op->len); - break; - case XEN_SYSCTL_TMEM_OP_RESTORE_FLUSH_PAGE: - ret = tmemc_restore_flush_page(op->cli_id, pool_id, oidp, op->arg); - break; - default: - ret = -1; - } - - return ret; -} - -/************ EXPORTed FUNCTIONS **************************************/ - -long do_tmem_op(tmem_cli_op_t uops) -{ - struct tmem_op op; - struct client *client = current->domain->tmem_client; - struct tmem_pool *pool = NULL; - struct xen_tmem_oid *oidp; - int rc = 0; - - if ( !tmem_initialized ) - return -ENODEV; - - if ( xsm_tmem_op(XSM_HOOK) ) - return -EPERM; - - tmem_stats.total_tmem_ops++; - - if ( client != NULL && client->domain->is_dying ) - { - tmem_stats.errored_tmem_ops++; - return -ENODEV; - } - - if ( unlikely(tmem_get_tmemop_from_client(&op, uops) != 0) ) - { - tmem_client_err("tmem: can't get tmem struct from %s\n", tmem_client_str); - tmem_stats.errored_tmem_ops++; - return -EFAULT; - } - - /* Acquire write lock for all commands at first. */ - write_lock(&tmem_rwlock); - - switch ( op.cmd ) - { - case TMEM_CONTROL: - case TMEM_RESTORE_NEW: - case TMEM_AUTH: - rc = -EOPNOTSUPP; - break; - - default: - /* - * For other commands, create per-client tmem structure dynamically on - * first use by client. - */ - if ( client == NULL ) - { - if ( (client = client_create(current->domain->domain_id)) == NULL ) - { - tmem_client_err("tmem: can't create tmem structure for %s\n", - tmem_client_str); - rc = -ENOMEM; - goto out; - } - } - - if ( op.cmd == TMEM_NEW_POOL || op.cmd == TMEM_DESTROY_POOL ) - { - if ( op.cmd == TMEM_NEW_POOL ) - rc = do_tmem_new_pool(TMEM_CLI_ID_NULL, 0, op.u.creat.flags, - op.u.creat.uuid[0], op.u.creat.uuid[1]); - else - rc = do_tmem_destroy_pool(op.pool_id); - } - else - { - if ( ((uint32_t)op.pool_id >= MAX_POOLS_PER_DOMAIN) || - ((pool = client->pools[op.pool_id]) == NULL) ) - { - tmem_client_err("tmem: operation requested on uncreated pool\n"); - rc = -ENODEV; - goto out; - } - /* Commands that only need read lock. */ - write_unlock(&tmem_rwlock); - read_lock(&tmem_rwlock); - - oidp = &op.u.gen.oid; - switch ( op.cmd ) - { - case TMEM_NEW_POOL: - case TMEM_DESTROY_POOL: - BUG(); /* Done earlier. */ - break; - case TMEM_PUT_PAGE: - if (tmem_ensure_avail_pages()) - rc = do_tmem_put(pool, oidp, op.u.gen.index, op.u.gen.cmfn, - tmem_cli_buf_null); - else - rc = -ENOMEM; - break; - case TMEM_GET_PAGE: - rc = do_tmem_get(pool, oidp, op.u.gen.index, op.u.gen.cmfn, - tmem_cli_buf_null); - break; - case TMEM_FLUSH_PAGE: - rc = do_tmem_flush_page(pool, oidp, op.u.gen.index); - break; - case TMEM_FLUSH_OBJECT: - rc = do_tmem_flush_object(pool, oidp); - break; - default: - tmem_client_warn("tmem: op %d not implemented\n", op.cmd); - rc = -ENOSYS; - break; - } - read_unlock(&tmem_rwlock); - if ( rc < 0 ) - tmem_stats.errored_tmem_ops++; - return rc; - } - break; - - } -out: - write_unlock(&tmem_rwlock); - if ( rc < 0 ) - tmem_stats.errored_tmem_ops++; - return rc; -} - -/* This should be called when the host is destroying a client (domain). */ -void tmem_destroy(void *v) -{ - struct client *client = (struct client *)v; - - if ( client == NULL ) - return; - - if ( !client->domain->is_dying ) - { - printk("tmem: tmem_destroy can only destroy dying client\n"); - return; - } - - write_lock(&tmem_rwlock); - - printk("tmem: flushing tmem pools for %s=%d\n", - tmem_cli_id_str, client->cli_id); - client_flush(client); - - write_unlock(&tmem_rwlock); -} - -#define MAX_EVICTS 10 /* Should be variable or set via XEN_SYSCTL_TMEM_OP_ ?? */ -void *tmem_relinquish_pages(unsigned int order, unsigned int memflags) -{ - struct page_info *pfp; - unsigned long evicts_per_relinq = 0; - int max_evictions = 10; - - if (!tmem_enabled() || !tmem_freeable_pages()) - return NULL; - - tmem_stats.relinq_attempts++; - if ( order > 0 ) - { -#ifndef NDEBUG - printk("tmem_relinquish_page: failing order=%d\n", order); -#endif - return NULL; - } - - while ( (pfp = tmem_page_list_get()) == NULL ) - { - if ( (max_evictions-- <= 0) || !tmem_evict()) - break; - evicts_per_relinq++; - } - if ( evicts_per_relinq > tmem_stats.max_evicts_per_relinq ) - tmem_stats.max_evicts_per_relinq = evicts_per_relinq; - if ( pfp != NULL ) - { - if ( !(memflags & MEMF_tmem) ) - scrub_one_page(pfp); - tmem_stats.relinq_pgs++; - } - - return pfp; -} - -unsigned long tmem_freeable_pages(void) -{ - if ( !tmem_enabled() ) - return 0; - - return tmem_page_list_pages + _atomic_read(freeable_page_count); -} - -/* Called at hypervisor startup. */ -static int __init init_tmem(void) -{ - if ( !tmem_enabled() ) - return 0; - - if ( !tmem_mempool_init() ) - return 0; - - if ( tmem_init() ) - { - printk("tmem: initialized comp=%d\n", tmem_compression_enabled()); - tmem_initialized = 1; - } - else - printk("tmem: initialization FAILED\n"); - - return 0; -} -__initcall(init_tmem); - -/* - * Local variables: - * mode: C - * c-file-style: "BSD" - * c-basic-offset: 4 - * tab-width: 4 - * indent-tabs-mode: nil - * End: - */ diff --git a/xen/common/tmem_control.c b/xen/common/tmem_control.c deleted file mode 100644 index 30bf6fb362..0000000000 --- a/xen/common/tmem_control.c +++ /dev/null @@ -1,560 +0,0 @@ -/* - * Copyright (c) 2016 Oracle and/or its affiliates. All rights reserved. - * - */ - -#include <xen/init.h> -#include <xen/list.h> -#include <xen/radix-tree.h> -#include <xen/rbtree.h> -#include <xen/rwlock.h> -#include <xen/tmem_control.h> -#include <xen/tmem.h> -#include <xen/tmem_xen.h> -#include <public/sysctl.h> - -/************ TMEM CONTROL OPERATIONS ************************************/ - -/* Freeze/thaw all pools belonging to client cli_id (all domains if -1). */ -static int tmemc_freeze_pools(domid_t cli_id, int arg) -{ - struct client *client; - bool freeze = arg == XEN_SYSCTL_TMEM_OP_FREEZE; - bool destroy = arg == XEN_SYSCTL_TMEM_OP_DESTROY; - char *s; - - s = destroy ? "destroyed" : ( freeze ? "frozen" : "thawed" ); - if ( cli_id == TMEM_CLI_ID_NULL ) - { - list_for_each_entry(client,&tmem_global.client_list,client_list) - client->info.flags.u.frozen = freeze; - tmem_client_info("tmem: all pools %s for all %ss\n", s, tmem_client_str); - } - else - { - if ( (client = tmem_client_from_cli_id(cli_id)) == NULL) - return -1; - client->info.flags.u.frozen = freeze; - tmem_client_info("tmem: all pools %s for %s=%d\n", - s, tmem_cli_id_str, cli_id); - } - return 0; -} - -static unsigned long tmem_flush_npages(unsigned long n) -{ - unsigned long avail_pages = 0; - - while ( (avail_pages = tmem_page_list_pages) < n ) - { - if ( !tmem_evict() ) - break; - } - if ( avail_pages ) - { - spin_lock(&tmem_page_list_lock); - while ( !page_list_empty(&tmem_page_list) ) - { - struct page_info *pg = page_list_remove_head(&tmem_page_list); - scrub_one_page(pg); - tmem_page_list_pages--; - free_domheap_page(pg); - } - ASSERT(tmem_page_list_pages == 0); - INIT_PAGE_LIST_HEAD(&tmem_page_list); - spin_unlock(&tmem_page_list_lock); - } - return avail_pages; -} - -static int tmemc_flush_mem(domid_t cli_id, uint32_t kb) -{ - uint32_t npages, flushed_pages, flushed_kb; - - if ( cli_id != TMEM_CLI_ID_NULL ) - { - tmem_client_warn("tmem: %s-specific flush not supported yet, use --all\n", - tmem_client_str); - return -1; - } - /* Convert kb to pages, rounding up if necessary. */ - npages = (kb + ((1 << (PAGE_SHIFT-10))-1)) >> (PAGE_SHIFT-10); - flushed_pages = tmem_flush_npages(npages); - flushed_kb = flushed_pages << (PAGE_SHIFT-10); - return flushed_kb; -} - -/* - * These tmemc_list* routines output lots of stats in a format that is - * intended to be program-parseable, not human-readable. Further, by - * tying each group of stats to a line format indicator (e.g. G= for - * global stats) and each individual stat to a two-letter specifier - * (e.g. Ec:nnnnn in the G= line says there are nnnnn pages in the - * global ephemeral pool), it should allow the stats reported to be - * forward and backwards compatible as tmem evolves. - */ -#define BSIZE 1024 - -static int tmemc_list_client(struct client *c, tmem_cli_va_param_t buf, - int off, uint32_t len, bool use_long) -{ - char info[BSIZE]; - int i, n = 0, sum = 0; - struct tmem_pool *p; - bool s; - - n = scnprintf(info,BSIZE,"C=CI:%d,ww:%d,co:%d,fr:%d," - "Tc:%"PRIu64",Ge:%ld,Pp:%ld,Gp:%ld%c", - c->cli_id, c->info.weight, c->info.flags.u.compress, c->info.flags.u.frozen, - c->total_cycles, c->succ_eph_gets, c->succ_pers_puts, c->succ_pers_gets, - use_long ? ',' : '\n'); - if (use_long) - n += scnprintf(info+n,BSIZE-n, - "Ec:%ld,Em:%ld,cp:%ld,cb:%"PRId64",cn:%ld,cm:%ld\n", - c->eph_count, c->eph_count_max, - c->compressed_pages, c->compressed_sum_size, - c->compress_poor, c->compress_nomem); - if ( !copy_to_guest_offset(buf, off + sum, info, n + 1) ) - sum += n; - for ( i = 0; i < MAX_POOLS_PER_DOMAIN; i++ ) - { - if ( (p = c->pools[i]) == NULL ) - continue; - s = is_shared(p); - n = scnprintf(info,BSIZE,"P=CI:%d,PI:%d," - "PT:%c%c,U0:%"PRIx64",U1:%"PRIx64"%c", - c->cli_id, p->pool_id, - is_persistent(p) ? 'P' : 'E', s ? 'S' : 'P', - (uint64_t)(s ? p->uuid[0] : 0), - (uint64_t)(s ? p->uuid[1] : 0LL), - use_long ? ',' : '\n'); - if (use_long) - n += scnprintf(info+n,BSIZE-n, - "Pc:%d,Pm:%d,Oc:%ld,Om:%ld,Nc:%lu,Nm:%lu," - "ps:%lu,pt:%lu,pd:%lu,pr:%lu,px:%lu,gs:%lu,gt:%lu," - "fs:%lu,ft:%lu,os:%lu,ot:%lu\n", - _atomic_read(p->pgp_count), p->pgp_count_max, - p->obj_count, p->obj_count_max, - p->objnode_count, p->objnode_count_max, - p->good_puts, p->puts,p->dup_puts_flushed, p->dup_puts_replaced, - p->no_mem_puts, - p->found_gets, p->gets, - p->flushs_found, p->flushs, p->flush_objs_found, p->flush_objs); - if ( sum + n >= len ) - return sum; - if ( !copy_to_guest_offset(buf, off + sum, info, n + 1) ) - sum += n; - } - return sum; -} - -static int tmemc_list_shared(tmem_cli_va_param_t buf, int off, uint32_t len, - bool use_long) -{ - char info[BSIZE]; - int i, n = 0, sum = 0; - struct tmem_pool *p; - struct share_list *sl; - - for ( i = 0; i < MAX_GLOBAL_SHARED_POOLS; i++ ) - { - if ( (p = tmem_global.shared_pools[i]) == NULL ) - continue; - n = scnprintf(info+n,BSIZE-n,"S=SI:%d,PT:%c%c,U0:%"PRIx64",U1:%"PRIx64, - i, is_persistent(p) ? 'P' : 'E', - is_shared(p) ? 'S' : 'P', - p->uuid[0], p->uuid[1]); - list_for_each_entry(sl,&p->share_list, share_list) - n += scnprintf(info+n,BSIZE-n,",SC:%d",sl->client->cli_id); - n += scnprintf(info+n,BSIZE-n,"%c", use_long ? ',' : '\n'); - if (use_long) - n += scnprintf(info+n,BSIZE-n, - "Pc:%d,Pm:%d,Oc:%ld,Om:%ld,Nc:%lu,Nm:%lu," - "ps:%lu,pt:%lu,pd:%lu,pr:%lu,px:%lu,gs:%lu,gt:%lu," - "fs:%lu,ft:%lu,os:%lu,ot:%lu\n", - _atomic_read(p->pgp_count), p->pgp_count_max, - p->obj_count, p->obj_count_max, - p->objnode_count, p->objnode_count_max, - p->good_puts, p->puts,p->dup_puts_flushed, p->dup_puts_replaced, - p->no_mem_puts, - p->found_gets, p->gets, - p->flushs_found, p->flushs, p->flush_objs_found, p->flush_objs); - if ( sum + n >= len ) - return sum; - if ( !copy_to_guest_offset(buf, off + sum, info, n + 1) ) - sum += n; - } - return sum; -} - -static int tmemc_list_global_perf(tmem_cli_va_param_t buf, int off, - uint32_t len, bool use_long) -{ - char info[BSIZE]; - int n = 0, sum = 0; - - n = scnprintf(info+n,BSIZE-n,"T="); - n--; /* Overwrite trailing comma. */ - n += scnprintf(info+n,BSIZE-n,"\n"); - if ( sum + n >= len ) - return sum; - if ( !copy_to_guest_offset(buf, off + sum, info, n + 1) ) - sum += n; - return sum; -} - -static int tmemc_list_global(tmem_cli_va_param_t buf, int off, uint32_t len, - bool use_long) -{ - char info[BSIZE]; - int n = 0, sum = off; - - n += scnprintf(info,BSIZE,"G=" - "Tt:%lu,Te:%lu,Cf:%lu,Af:%lu,Pf:%lu,Ta:%lu," - "Lm:%lu,Et:%lu,Ea:%lu,Rt:%lu,Ra:%lu,Rx:%lu,Fp:%lu%c", - tmem_stats.total_tmem_ops, tmem_stats.errored_tmem_ops, tmem_stats.failed_copies, - tmem_stats.alloc_failed, tmem_stats.alloc_page_failed, tmem_page_list_pages, - tmem_stats.low_on_memory, tmem_stats.evicted_pgs, - tmem_stats.evict_attempts, tmem_stats.relinq_pgs, tmem_stats.relinq_attempts, - tmem_stats.max_evicts_per_relinq, - tmem_stats.total_flush_pool, use_long ? ',' : '\n'); - if (use_long) - n += scnprintf(info+n,BSIZE-n, - "Ec:%ld,Em:%ld,Oc:%d,Om:%d,Nc:%d,Nm:%d,Pc:%d,Pm:%d," - "Fc:%d,Fm:%d,Sc:%d,Sm:%d,Ep:%lu,Gd:%lu,Zt:%lu,Gz:%lu\n", - tmem_global.eph_count, tmem_stats.global_eph_count_max, - _atomic_read(tmem_stats.global_obj_count), tmem_stats.global_obj_count_max, - _atomic_read(tmem_stats.global_rtree_node_count), tmem_stats.global_rtree_node_count_max, - _atomic_read(tmem_stats.global_pgp_count), tmem_stats.global_pgp_count_max, - _atomic_read(tmem_stats.global_page_count), tmem_stats.global_page_count_max, - _atomic_read(tmem_stats.global_pcd_count), tmem_stats.global_pcd_count_max, - tmem_stats.tot_good_eph_puts,tmem_stats.deduped_puts,tmem_stats.pcd_tot_tze_size, - tmem_stats.pcd_tot_csize); - if ( sum + n >= len ) - return sum; - if ( !copy_to_guest_offset(buf, off + sum, info, n + 1) ) - sum += n; - return sum; -} - -static int tmemc_list(domid_t cli_id, tmem_cli_va_param_t buf, uint32_t len, - bool use_long) -{ - struct client *client; - int off = 0; - - if ( cli_id == TMEM_CLI_ID_NULL ) { - off = tmemc_list_global(buf,0,len,use_long); - off += tmemc_list_shared(buf,off,len-off,use_long); - list_for_each_entry(client,&tmem_global.client_list,client_list) - off += tmemc_list_client(client, buf, off, len-off, use_long); - off += tmemc_list_global_perf(buf,off,len-off,use_long); - } - else if ( (client = tmem_client_from_cli_id(cli_id)) == NULL) - return -1; - else - off = tmemc_list_client(client, buf, 0, len, use_long); - - return 0; -} - -static int __tmemc_set_client_info(struct client *client, - XEN_GUEST_HANDLE(xen_tmem_client_t) buf) -{ - domid_t cli_id; - uint32_t old_weight; - xen_tmem_client_t info = { }; - - ASSERT(client); - - if ( copy_from_guest(&info, buf, 1) ) - return -EFAULT; - - if ( info.version != TMEM_SPEC_VERSION ) - return -EOPNOTSUPP; - - if ( info.maxpools > MAX_POOLS_PER_DOMAIN ) - return -ERANGE; - - /* Ignore info.nr_pools. */ - cli_id = client->cli_id; - - if ( info.weight != client->info.weight ) - { - old_weight = client->info.weight; - client->info.weight = info.weight; - tmem_client_info("tmem: weight set to %d for %s=%d\n", - info.weight, tmem_cli_id_str, cli_id); - atomic_sub(old_weight,&tmem_global.client_weight_total); - atomic_add(client->info.weight,&tmem_global.client_weight_total); - } - - - if ( info.flags.u.compress != client->info.flags.u.compress ) - { - client->info.flags.u.compress = info.flags.u.compress; - tmem_client_info("tmem: compression %s for %s=%d\n", - info.flags.u.compress ? "enabled" : "disabled", - tmem_cli_id_str,cli_id); - } - return 0; -} - -static int tmemc_set_client_info(domid_t cli_id, - XEN_GUEST_HANDLE(xen_tmem_client_t) info) -{ - struct client *client; - int ret = -ENOENT; - - if ( cli_id == TMEM_CLI_ID_NULL ) - { - list_for_each_entry(client,&tmem_global.client_list,client_list) - { - ret = __tmemc_set_client_info(client, info); - if (ret) - break; - } - } - else - { - client = tmem_client_from_cli_id(cli_id); - if ( client ) - ret = __tmemc_set_client_info(client, info); - } - return ret; -} - -static int tmemc_get_client_info(int cli_id, - XEN_GUEST_HANDLE(xen_tmem_client_t) info) -{ - struct client *client = tmem_client_from_cli_id(cli_id); - - if ( client ) - { - if ( copy_to_guest(info, &client->info, 1) ) - return -EFAULT; - } - else - { - static const xen_tmem_client_t generic = { - .version = TMEM_SPEC_VERSION, - .maxpools = MAX_POOLS_PER_DOMAIN - }; - - if ( copy_to_guest(info, &generic, 1) ) - return -EFAULT; - } - - return 0; -} - -static int tmemc_get_pool(int cli_id, - XEN_GUEST_HANDLE(xen_tmem_pool_info_t) pools, - uint32_t len) -{ - struct client *client = tmem_client_from_cli_id(cli_id); - unsigned int i, idx; - int rc = 0; - unsigned int nr = len / sizeof(xen_tmem_pool_info_t); - - if ( len % sizeof(xen_tmem_pool_info_t) ) - return -EINVAL; - - if ( nr > MAX_POOLS_PER_DOMAIN ) - return -E2BIG; - - if ( !guest_handle_okay(pools, nr) ) - return -EINVAL; - - if ( !client ) - return -EINVAL; - - for ( idx = 0, i = 0; i < MAX_POOLS_PER_DOMAIN; i++ ) - { - struct tmem_pool *pool = client->pools[i]; - xen_tmem_pool_info_t out; - - if ( pool == NULL ) - continue; - - out.flags.raw = (pool->persistent ? TMEM_POOL_PERSIST : 0) | - (pool->shared ? TMEM_POOL_SHARED : 0) | - (POOL_PAGESHIFT << TMEM_POOL_PAGESIZE_SHIFT) | - (TMEM_SPEC_VERSION << TMEM_POOL_VERSION_SHIFT); - out.n_pages = _atomic_read(pool->pgp_count); - out.uuid[0] = pool->uuid[0]; - out.uuid[1] = pool->uuid[1]; - out.id = i; - - /* N.B. 'idx' != 'i'. */ - if ( __copy_to_guest_offset(pools, idx, &out, 1) ) - { - rc = -EFAULT; - break; - } - idx++; - /* Don't try to put more than what was requested. */ - if ( idx >= nr ) - break; - } - - /* And how many we have processed. */ - return rc ? : idx; -} - -static int tmemc_set_pools(int cli_id, - XEN_GUEST_HANDLE(xen_tmem_pool_info_t) pools, - uint32_t len) -{ - unsigned int i; - int rc = 0; - unsigned int nr = len / sizeof(xen_tmem_pool_info_t); - struct client *client = tmem_client_from_cli_id(cli_id); - - if ( len % sizeof(xen_tmem_pool_info_t) ) - return -EINVAL; - - if ( nr > MAX_POOLS_PER_DOMAIN ) - return -E2BIG; - - if ( !guest_handle_okay(pools, nr) ) - return -EINVAL; - - if ( !client ) - { - client = client_create(cli_id); - if ( !client ) - return -ENOMEM; - } - for ( i = 0; i < nr; i++ ) - { - xen_tmem_pool_info_t pool; - - if ( __copy_from_guest_offset(&pool, pools, i, 1 ) ) - return -EFAULT; - - if ( pool.n_pages ) - return -EINVAL; - - rc = do_tmem_new_pool(cli_id, pool.id, pool.flags.raw, - pool.uuid[0], pool.uuid[1]); - if ( rc < 0 ) - break; - - pool.id = rc; - if ( __copy_to_guest_offset(pools, i, &pool, 1) ) - return -EFAULT; - } - - /* And how many we have processed. */ - return rc ? : i; -} - -static int tmemc_auth_pools(int cli_id, - XEN_GUEST_HANDLE(xen_tmem_pool_info_t) pools, - uint32_t len) -{ - unsigned int i; - int rc = 0; - unsigned int nr = len / sizeof(xen_tmem_pool_info_t); - struct client *client = tmem_client_from_cli_id(cli_id); - - if ( len % sizeof(xen_tmem_pool_info_t) ) - return -EINVAL; - - if ( nr > MAX_POOLS_PER_DOMAIN ) - return -E2BIG; - - if ( !guest_handle_okay(pools, nr) ) - return -EINVAL; - - if ( !client ) - { - client = client_create(cli_id); - if ( !client ) - return -ENOMEM; - } - - for ( i = 0; i < nr; i++ ) - { - xen_tmem_pool_info_t pool; - - if ( __copy_from_guest_offset(&pool, pools, i, 1 ) ) - return -EFAULT; - - if ( pool.n_pages ) - return -EINVAL; - - rc = tmemc_shared_pool_auth(cli_id, pool.uuid[0], pool.uuid[1], - pool.flags.u.auth); - - if ( rc < 0 ) - break; - - } - - /* And how many we have processed. */ - return rc ? : i; -} - -int tmem_control(struct xen_sysctl_tmem_op *op) -{ - int ret; - uint32_t cmd = op->cmd; - - if ( op->pad != 0 ) - return -EINVAL; - - write_lock(&tmem_rwlock); - - switch (cmd) - { - case XEN_SYSCTL_TMEM_OP_THAW: - case XEN_SYSCTL_TMEM_OP_FREEZE: - case XEN_SYSCTL_TMEM_OP_DESTROY: - ret = tmemc_freeze_pools(op->cli_id, cmd); - break; - case XEN_SYSCTL_TMEM_OP_FLUSH: - ret = tmemc_flush_mem(op->cli_id, op->arg); - break; - case XEN_SYSCTL_TMEM_OP_LIST: - ret = tmemc_list(op->cli_id, - guest_handle_cast(op->u.buf, char), op->len, op->arg); - break; - case XEN_SYSCTL_TMEM_OP_SET_CLIENT_INFO: - ret = tmemc_set_client_info(op->cli_id, op->u.client); - break; - case XEN_SYSCTL_TMEM_OP_QUERY_FREEABLE_MB: - ret = tmem_freeable_pages() >> (20 - PAGE_SHIFT); - break; - case XEN_SYSCTL_TMEM_OP_GET_CLIENT_INFO: - ret = tmemc_get_client_info(op->cli_id, op->u.client); - break; - case XEN_SYSCTL_TMEM_OP_GET_POOLS: - ret = tmemc_get_pool(op->cli_id, op->u.pool, op->len); - break; - case XEN_SYSCTL_TMEM_OP_SET_POOLS: /* TMEM_RESTORE_NEW */ - ret = tmemc_set_pools(op->cli_id, op->u.pool, op->len); - break; - case XEN_SYSCTL_TMEM_OP_SET_AUTH: /* TMEM_AUTH */ - ret = tmemc_auth_pools(op->cli_id, op->u.pool, op->len); - break; - default: - ret = do_tmem_control(op); - break; - } - - write_unlock(&tmem_rwlock); - - return ret; -} - -/* - * Local variables: - * mode: C - * c-file-style: "BSD" - * c-basic-offset: 4 - * tab-width: 4 - * indent-tabs-mode: nil - * End: - */ diff --git a/xen/common/tmem_xen.c b/xen/common/tmem_xen.c deleted file mode 100644 index bf7b14f79a..0000000000 --- a/xen/common/tmem_xen.c +++ /dev/null @@ -1,277 +0,0 @@ -/****************************************************************************** - * tmem-xen.c - * - * Xen-specific Transcendent memory - * - * Copyright (c) 2009, Dan Magenheimer, Oracle Corp. - */ - -#include <xen/tmem.h> -#include <xen/tmem_xen.h> -#include <xen/lzo.h> /* compression code */ -#include <xen/paging.h> -#include <xen/domain_page.h> -#include <xen/cpu.h> -#include <xen/init.h> - -bool __read_mostly opt_tmem; -boolean_param("tmem", opt_tmem); - -bool __read_mostly opt_tmem_compress; -boolean_param("tmem_compress", opt_tmem_compress); - -atomic_t freeable_page_count = ATOMIC_INIT(0); - -/* these are a concurrency bottleneck, could be percpu and dynamically - * allocated iff opt_tmem_compress */ -#define LZO_WORKMEM_BYTES LZO1X_1_MEM_COMPRESS -#define LZO_DSTMEM_PAGES 2 -static DEFINE_PER_CPU_READ_MOSTLY(unsigned char *, workmem); -static DEFINE_PER_CPU_READ_MOSTLY(unsigned char *, dstmem); -static DEFINE_PER_CPU_READ_MOSTLY(void *, scratch_page); - -#if defined(CONFIG_ARM) -static inline void *cli_get_page(xen_pfn_t cmfn, mfn_t *pcli_mfn, - struct page_info **pcli_pfp, bool cli_write) -{ - ASSERT_UNREACHABLE(); - return NULL; -} - -static inline void cli_put_page(void *cli_va, struct page_info *cli_pfp, - mfn_t cli_mfn, bool mark_dirty) -{ - ASSERT_UNREACHABLE(); -} -#else -#include <asm/p2m.h> - -static inline void *cli_get_page(xen_pfn_t cmfn, mfn_t *pcli_mfn, - struct page_info **pcli_pfp, bool cli_write) -{ - p2m_type_t t; - struct page_info *page; - - page = get_page_from_gfn(current->domain, cmfn, &t, P2M_ALLOC); - if ( !page || t != p2m_ram_rw ) - { - if ( page ) - put_page(page); - return NULL; - } - - if ( cli_write && !get_page_type(page, PGT_writable_page) ) - { - put_page(page); - return NULL; - } - - *pcli_mfn = page_to_mfn(page); - *pcli_pfp = page; - - return map_domain_page(*pcli_mfn); -} - -static inline void cli_put_page(void *cli_va, struct page_info *cli_pfp, - mfn_t cli_mfn, bool mark_dirty) -{ - if ( mark_dirty ) - { - put_page_and_type(cli_pfp); - paging_mark_dirty(current->domain, cli_mfn); - } - else - put_page(cli_pfp); - unmap_domain_page(cli_va); -} -#endif - -int tmem_copy_from_client(struct page_info *pfp, - xen_pfn_t cmfn, tmem_cli_va_param_t clibuf) -{ - mfn_t tmem_mfn, cli_mfn = INVALID_MFN; - char *tmem_va, *cli_va = NULL; - struct page_info *cli_pfp = NULL; - int rc = 1; - - ASSERT(pfp != NULL); - tmem_mfn = page_to_mfn(pfp); - tmem_va = map_domain_page(tmem_mfn); - if ( guest_handle_is_null(clibuf) ) - { - cli_va = cli_get_page(cmfn, &cli_mfn, &cli_pfp, 0); - if ( cli_va == NULL ) - { - unmap_domain_page(tmem_va); - return -EFAULT; - } - } - smp_mb(); - if ( cli_va ) - { - memcpy(tmem_va, cli_va, PAGE_SIZE); - cli_put_page(cli_va, cli_pfp, cli_mfn, 0); - } - else - rc = -EINVAL; - unmap_domain_page(tmem_va); - return rc; -} - -int tmem_compress_from_client(xen_pfn_t cmfn, - void **out_va, size_t *out_len, tmem_cli_va_param_t clibuf) -{ - int ret = 0; - unsigned char *dmem = this_cpu(dstmem); - unsigned char *wmem = this_cpu(workmem); - char *scratch = this_cpu(scratch_page); - struct page_info *cli_pfp = NULL; - mfn_t cli_mfn = INVALID_MFN; - void *cli_va = NULL; - - if ( dmem == NULL || wmem == NULL ) - return 0; /* no buffer, so can't compress */ - if ( guest_handle_is_null(clibuf) ) - { - cli_va = cli_get_page(cmfn, &cli_mfn, &cli_pfp, 0); - if ( cli_va == NULL ) - return -EFAULT; - } - else if ( !scratch ) - return 0; - else if ( copy_from_guest(scratch, clibuf, PAGE_SIZE) ) - return -EFAULT; - smp_mb(); - ret = lzo1x_1_compress(cli_va ?: scratch, PAGE_SIZE, dmem, out_len, wmem); - ASSERT(ret == LZO_E_OK); - *out_va = dmem; - if ( cli_va ) - cli_put_page(cli_va, cli_pfp, cli_mfn, 0); - return 1; -} - -int tmem_copy_to_client(xen_pfn_t cmfn, struct page_info *pfp, - tmem_cli_va_param_t clibuf) -{ - mfn_t tmem_mfn, cli_mfn = INVALID_MFN; - char *tmem_va, *cli_va = NULL; - struct page_info *cli_pfp = NULL; - int rc = 1; - - ASSERT(pfp != NULL); - if ( guest_handle_is_null(clibuf) ) - { - cli_va = cli_get_page(cmfn, &cli_mfn, &cli_pfp, 1); - if ( cli_va == NULL ) - return -EFAULT; - } - tmem_mfn = page_to_mfn(pfp); - tmem_va = map_domain_page(tmem_mfn); - - if ( cli_va ) - { - memcpy(cli_va, tmem_va, PAGE_SIZE); - cli_put_page(cli_va, cli_pfp, cli_mfn, 1); - } - else - rc = -EINVAL; - unmap_domain_page(tmem_va); - smp_mb(); - return rc; -} - -int tmem_decompress_to_client(xen_pfn_t cmfn, void *tmem_va, - size_t size, tmem_cli_va_param_t clibuf) -{ - mfn_t cli_mfn = INVALID_MFN; - struct page_info *cli_pfp = NULL; - void *cli_va = NULL; - char *scratch = this_cpu(scratch_page); - size_t out_len = PAGE_SIZE; - int ret; - - if ( guest_handle_is_null(clibuf) ) - { - cli_va = cli_get_page(cmfn, &cli_mfn, &cli_pfp, 1); - if ( cli_va == NULL ) - return -EFAULT; - } - else if ( !scratch ) - return 0; - ret = lzo1x_decompress_safe(tmem_va, size, cli_va ?: scratch, &out_len); - ASSERT(ret == LZO_E_OK); - ASSERT(out_len == PAGE_SIZE); - if ( cli_va ) - cli_put_page(cli_va, cli_pfp, cli_mfn, 1); - else if ( copy_to_guest(clibuf, scratch, PAGE_SIZE) ) - return -EFAULT; - smp_mb(); - return 1; -} - -/****************** XEN-SPECIFIC HOST INITIALIZATION ********************/ -static int dstmem_order, workmem_order; - -static int cpu_callback( - struct notifier_block *nfb, unsigned long action, void *hcpu) -{ - unsigned int cpu = (unsigned long)hcpu; - - switch ( action ) - { - case CPU_UP_PREPARE: { - if ( per_cpu(dstmem, cpu) == NULL ) - per_cpu(dstmem, cpu) = alloc_xenheap_pages(dstmem_order, 0); - if ( per_cpu(workmem, cpu) == NULL ) - per_cpu(workmem, cpu) = alloc_xenheap_pages(workmem_order, 0); - if ( per_cpu(scratch_page, cpu) == NULL ) - per_cpu(scratch_page, cpu) = alloc_xenheap_page(); - break; - } - case CPU_DEAD: - case CPU_UP_CANCELED: { - if ( per_cpu(dstmem, cpu) != NULL ) - { - free_xenheap_pages(per_cpu(dstmem, cpu), dstmem_order); - per_cpu(dstmem, cpu) = NULL; - } - if ( per_cpu(workmem, cpu) != NULL ) - { - free_xenheap_pages(per_cpu(workmem, cpu), workmem_order); - per_cpu(workmem, cpu) = NULL; - } - if ( per_cpu(scratch_page, cpu) != NULL ) - { - free_xenheap_page(per_cpu(scratch_page, cpu)); - per_cpu(scratch_page, cpu) = NULL; - } - break; - } - default: - break; - } - - return NOTIFY_DONE; -} - -static struct notifier_block cpu_nfb = { - .notifier_call = cpu_callback -}; - -int __init tmem_init(void) -{ - unsigned int cpu; - - dstmem_order = get_order_from_pages(LZO_DSTMEM_PAGES); - workmem_order = get_order_from_bytes(LZO1X_1_MEM_COMPRESS); - - for_each_online_cpu ( cpu ) - { - void *hcpu = (void *)(long)cpu; - cpu_callback(&cpu_nfb, CPU_UP_PREPARE, hcpu); - } - - register_cpu_notifier(&cpu_nfb); - - return 1; -} diff --git a/xen/include/Makefile b/xen/include/Makefile index f7895e4d4e..325a0b88d9 100644 --- a/xen/include/Makefile +++ b/xen/include/Makefile @@ -16,7 +16,6 @@ headers-y := \ compat/physdev.h \ compat/platform.h \ compat/sched.h \ - compat/tmem.h \ compat/trace.h \ compat/vcpu.h \ compat/version.h \ diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h index 1ccf20787a..1b83407fcd 100644 --- a/xen/include/public/sysctl.h +++ b/xen/include/public/sysctl.h @@ -34,7 +34,6 @@ #include "xen.h" #include "domctl.h" #include "physdev.h" -#include "tmem.h" #define XEN_SYSCTL_INTERFACE_VERSION 0x00000012 @@ -732,110 +731,6 @@ struct xen_sysctl_psr_alloc { } u; }; -#define XEN_SYSCTL_TMEM_OP_ALL_CLIENTS 0xFFFFU - -#define XEN_SYSCTL_TMEM_OP_THAW 0 -#define XEN_SYSCTL_TMEM_OP_FREEZE 1 -#define XEN_SYSCTL_TMEM_OP_FLUSH 2 -#define XEN_SYSCTL_TMEM_OP_DESTROY 3 -#define XEN_SYSCTL_TMEM_OP_LIST 4 -#define XEN_SYSCTL_TMEM_OP_GET_CLIENT_INFO 5 -#define XEN_SYSCTL_TMEM_OP_SET_CLIENT_INFO 6 -#define XEN_SYSCTL_TMEM_OP_GET_POOLS 7 -#define XEN_SYSCTL_TMEM_OP_QUERY_FREEABLE_MB 8 -#define XEN_SYSCTL_TMEM_OP_SET_POOLS 9 -#define XEN_SYSCTL_TMEM_OP_SAVE_BEGIN 10 -#define XEN_SYSCTL_TMEM_OP_SET_AUTH 11 -#define XEN_SYSCTL_TMEM_OP_SAVE_GET_NEXT_PAGE 19 -#define XEN_SYSCTL_TMEM_OP_SAVE_GET_NEXT_INV 20 -#define XEN_SYSCTL_TMEM_OP_SAVE_END 21 -#define XEN_SYSCTL_TMEM_OP_RESTORE_BEGIN 30 -#define XEN_SYSCTL_TMEM_OP_RESTORE_PUT_PAGE 32 -#define XEN_SYSCTL_TMEM_OP_RESTORE_FLUSH_PAGE 33 - -/* - * XEN_SYSCTL_TMEM_OP_SAVE_GET_NEXT_[PAGE|INV] override the 'buf' in - * xen_sysctl_tmem_op with this structure - sometimes with an extra - * page tackled on. - */ -struct tmem_handle { - uint32_t pool_id; - uint32_t index; - xen_tmem_oid_t oid; -}; - -/* - * XEN_SYSCTL_TMEM_OP_[GET,SAVE]_CLIENT uses the 'client' in - * xen_tmem_op with this structure, which is mostly used during migration. - */ -struct xen_tmem_client { - uint32_t version; /* If mismatched we will get XEN_EOPNOTSUPP. */ - uint32_t maxpools; /* If greater than what hypervisor supports, will get - XEN_ERANGE. */ - uint32_t nr_pools; /* Current amount of pools. Ignored on SET*/ - union { /* See TMEM_CLIENT_[COMPRESS,FROZEN] */ - uint32_t raw; - struct { - uint8_t frozen:1, - compress:1, - migrating:1; - } u; - } flags; - uint32_t weight; -}; -typedef struct xen_tmem_client xen_tmem_client_t; -DEFINE_XEN_GUEST_HANDLE(xen_tmem_client_t); - -/* - * XEN_SYSCTL_TMEM_OP_[GET|SET]_POOLS or XEN_SYSCTL_TMEM_OP_SET_AUTH - * uses the 'pool' array in * xen_sysctl_tmem_op with this structure. - * The XEN_SYSCTL_TMEM_OP_GET_POOLS hypercall will - * return the number of entries in 'pool' or a negative value - * if an error was encountered. - * The XEN_SYSCTL_TMEM_OP_SET_[AUTH|POOLS] will return the number of - * entries in 'pool' processed or a negative value if an error - * was encountered. - */ -struct xen_tmem_pool_info { - union { - uint32_t raw; - struct { - uint32_t persist:1, /* See TMEM_POOL_PERSIST. */ - shared:1, /* See TMEM_POOL_SHARED. */ - auth:1, /* See TMEM_POOL_AUTH. */ - rsv1:1, - pagebits:8, /* TMEM_POOL_PAGESIZE_[SHIFT,MASK]. */ - rsv2:12, - version:8; /* TMEM_POOL_VERSION_[SHIFT,MASK]. */ - } u; - } flags; - uint32_t id; /* Less than tmem_client.maxpools. */ - uint64_t n_pages; /* Zero on XEN_SYSCTL_TMEM_OP_SET_[AUTH|POOLS]. */ - uint64_aligned_t uuid[2]; -}; -typedef struct xen_tmem_pool_info xen_tmem_pool_info_t; -DEFINE_XEN_GUEST_HANDLE(xen_tmem_pool_info_t); - -struct xen_sysctl_tmem_op { - uint32_t cmd; /* IN: XEN_SYSCTL_TMEM_OP_* . */ - int32_t pool_id; /* IN: 0 by default unless _SAVE_*, RESTORE_* .*/ - uint32_t cli_id; /* IN: client id, 0 for XEN_SYSCTL_TMEM_QUERY_FREEABLE_MB - for all others can be the domain id or - XEN_SYSCTL_TMEM_OP_ALL_CLIENTS for all. */ - uint32_t len; /* IN: length of 'buf'. If not applicable to use 0. */ - uint32_t arg; /* IN: If not applicable to command use 0. */ - uint32_t pad; /* Padding so structure is the same under 32 and 64. */ - xen_tmem_oid_t oid; /* IN: If not applicable to command use 0s. */ - union { - XEN_GUEST_HANDLE_64(char) buf; /* IN/OUT: Buffer to save/restore */ - XEN_GUEST_HANDLE_64(xen_tmem_client_t) client; /* IN/OUT for */ - /* XEN_SYSCTL_TMEM_OP_[GET,SAVE]_CLIENT. */ - XEN_GUEST_HANDLE_64(xen_tmem_pool_info_t) pool; /* OUT for */ - /* XEN_SYSCTL_TMEM_OP_GET_POOLS. Must have 'len' */ - /* of them. */ - } u; -}; - /* * XEN_SYSCTL_get_cpu_levelling_caps (x86 specific) * @@ -1124,7 +1019,7 @@ struct xen_sysctl { #define XEN_SYSCTL_psr_cmt_op 21 #define XEN_SYSCTL_pcitopoinfo 22 #define XEN_SYSCTL_psr_alloc 23 -#define XEN_SYSCTL_tmem_op 24 +/* #define XEN_SYSCTL_tmem_op 24 */ #define XEN_SYSCTL_get_cpu_levelling_caps 25 #define XEN_SYSCTL_get_cpu_featureset 26 #define XEN_SYSCTL_livepatch_op 27 @@ -1154,7 +1049,6 @@ struct xen_sysctl { struct xen_sysctl_coverage_op coverage_op; struct xen_sysctl_psr_cmt_op psr_cmt_op; struct xen_sysctl_psr_alloc psr_alloc; - struct xen_sysctl_tmem_op tmem_op; struct xen_sysctl_cpu_levelling_caps cpu_levelling_caps; struct xen_sysctl_cpu_featureset cpu_featureset; struct xen_sysctl_livepatch_op livepatch; diff --git a/xen/include/public/tmem.h b/xen/include/public/tmem.h deleted file mode 100644 index aa0aafaa9d..0000000000 --- a/xen/include/public/tmem.h +++ /dev/null @@ -1,124 +0,0 @@ -/****************************************************************************** - * tmem.h - * - * Guest OS interface to Xen Transcendent Memory. - * - * Permission is hereby granted, free of charge, to any person obtaining a copy - * of this software and associated documentation files (the "Software"), to - * deal in the Software without restriction, including without limitation the - * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or - * sell copies of the Software, and to permit persons to whom the Software is - * furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice shall be included in - * all copies or substantial portions of the Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE - * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING - * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER - * DEALINGS IN THE SOFTWARE. - * - * Copyright (c) 2004, K A Fraser - */ - -#ifndef __XEN_PUBLIC_TMEM_H__ -#define __XEN_PUBLIC_TMEM_H__ - -#include "xen.h" - -/* version of ABI */ -#define TMEM_SPEC_VERSION 1 - -/* Commands to HYPERVISOR_tmem_op() */ -#ifdef __XEN__ -#define TMEM_CONTROL 0 /* Now called XEN_SYSCTL_tmem_op */ -#else -#undef TMEM_CONTROL -#endif -#define TMEM_NEW_POOL 1 -#define TMEM_DESTROY_POOL 2 -#define TMEM_PUT_PAGE 4 -#define TMEM_GET_PAGE 5 -#define TMEM_FLUSH_PAGE 6 -#define TMEM_FLUSH_OBJECT 7 -#if __XEN_INTERFACE_VERSION__ < 0x00040400 -#define TMEM_NEW_PAGE 3 -#define TMEM_READ 8 -#define TMEM_WRITE 9 -#define TMEM_XCHG 10 -#endif - -/* Privileged commands now called via XEN_SYSCTL_tmem_op. */ -#define TMEM_AUTH 101 /* as XEN_SYSCTL_TMEM_OP_SET_AUTH. */ -#define TMEM_RESTORE_NEW 102 /* as XEN_SYSCTL_TMEM_OP_SET_POOL. */ - -/* Bits for HYPERVISOR_tmem_op(TMEM_NEW_POOL) */ -#define TMEM_POOL_PERSIST 1 -#define TMEM_POOL_SHARED 2 -#define TMEM_POOL_PRECOMPRESSED 4 -#define TMEM_POOL_PAGESIZE_SHIFT 4 -#define TMEM_POOL_PAGESIZE_MASK 0xf -#define TMEM_POOL_VERSION_SHIFT 24 -#define TMEM_POOL_VERSION_MASK 0xff -#define TMEM_POOL_RESERVED_BITS 0x00ffff00 - -/* Bits for client flags (save/restore) */ -#define TMEM_CLIENT_COMPRESS 1 -#define TMEM_CLIENT_FROZEN 2 - -/* Special errno values */ -#define EFROZEN 1000 -#define EEMPTY 1001 - -struct xen_tmem_oid { - uint64_t oid[3]; -}; -typedef struct xen_tmem_oid xen_tmem_oid_t; -DEFINE_XEN_GUEST_HANDLE(xen_tmem_oid_t); - -#ifndef __ASSEMBLY__ -#if __XEN_INTERFACE_VERSION__ < 0x00040400 -typedef xen_pfn_t tmem_cli_mfn_t; -#endif -typedef XEN_GUEST_HANDLE(char) tmem_cli_va_t; -struct tmem_op { - uint32_t cmd; - int32_t pool_id; - union { - struct { - uint64_t uuid[2]; - uint32_t flags; - uint32_t arg1; - } creat; /* for cmd == TMEM_NEW_POOL. */ - struct { -#if __XEN_INTERFACE_VERSION__ < 0x00040600 - uint64_t oid[3]; -#else - xen_tmem_oid_t oid; -#endif - uint32_t index; - uint32_t tmem_offset; - uint32_t pfn_offset; - uint32_t len; - xen_pfn_t cmfn; /* client machine page frame */ - } gen; /* for all other cmd ("generic") */ - } u; -}; -typedef struct tmem_op tmem_op_t; -DEFINE_XEN_GUEST_HANDLE(tmem_op_t); -#endif - -#endif /* __XEN_PUBLIC_TMEM_H__ */ - -/* - * Local variables: - * mode: C - * c-file-style: "BSD" - * c-basic-offset: 4 - * tab-width: 4 - * indent-tabs-mode: nil - * End: - */ diff --git a/xen/include/xen/hypercall.h b/xen/include/xen/hypercall.h index cc99aea57d..888775f9a7 100644 --- a/xen/include/xen/hypercall.h +++ b/xen/include/xen/hypercall.h @@ -12,7 +12,6 @@ #include <public/sysctl.h> #include <public/platform.h> #include <public/event_channel.h> -#include <public/tmem.h> #include <public/version.h> #include <public/pmu.h> #include <public/hvm/dm_op.h> @@ -130,12 +129,6 @@ extern long do_xsm_op( XEN_GUEST_HANDLE_PARAM(xsm_op_t) u_xsm_op); -#ifdef CONFIG_TMEM -extern long -do_tmem_op( - XEN_GUEST_HANDLE_PARAM(tmem_op_t) uops); -#endif - extern long do_xenoprof_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 0309c1f2a0..c8ca3e6853 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -455,9 +455,6 @@ struct domain */ spinlock_t hypercall_deadlock_mutex; - /* transcendent memory, auto-allocated on first tmem op by each domain */ - struct client *tmem_client; - struct lock_profile_qhead profile_head; /* Various vm_events */ diff --git a/xen/include/xen/tmem.h b/xen/include/xen/tmem.h deleted file mode 100644 index 414a14d808..0000000000 --- a/xen/include/xen/tmem.h +++ /dev/null @@ -1,45 +0,0 @@ -/****************************************************************************** - * tmem.h - * - * Transcendent memory - * - * Copyright (c) 2008, Dan Magenheimer, Oracle Corp. - */ - -#ifndef __XEN_TMEM_H__ -#define __XEN_TMEM_H__ - -struct xen_sysctl_tmem_op; - -#ifdef CONFIG_TMEM -extern int tmem_control(struct xen_sysctl_tmem_op *op); -extern void tmem_destroy(void *); -extern void *tmem_relinquish_pages(unsigned int, unsigned int); -extern unsigned long tmem_freeable_pages(void); -#else -static inline int -tmem_control(struct xen_sysctl_tmem_op *op) -{ - return -ENOSYS; -} - -static inline void -tmem_destroy(void *p) -{ - return; -} - -static inline void * -tmem_relinquish_pages(unsigned int x, unsigned int y) -{ - return NULL; -} - -static inline unsigned long -tmem_freeable_pages(void) -{ - return 0; -} -#endif /* CONFIG_TMEM */ - -#endif /* __XEN_TMEM_H__ */ diff --git a/xen/include/xen/tmem_control.h b/xen/include/xen/tmem_control.h deleted file mode 100644 index ad04cf707b..0000000000 --- a/xen/include/xen/tmem_control.h +++ /dev/null @@ -1,39 +0,0 @@ -/* - * Copyright (c) 2016 Oracle and/or its affiliates. All rights reserved. - * - */ - -#ifndef __XEN_TMEM_CONTROL_H__ -#define __XEN_TMEM_CONTROL_H__ - -#ifdef CONFIG_TMEM -#include <public/sysctl.h> -/* Variables and functions that tmem_control.c needs from tmem.c */ - -extern struct tmem_statistics tmem_stats; -extern struct tmem_global tmem_global; - -extern rwlock_t tmem_rwlock; - -int tmem_evict(void); -int do_tmem_control(struct xen_sysctl_tmem_op *op); - -struct client *client_create(domid_t cli_id); -int do_tmem_new_pool(domid_t this_cli_id, uint32_t d_poolid, uint32_t flags, - uint64_t uuid_lo, uint64_t uuid_hi); - -int tmemc_shared_pool_auth(domid_t cli_id, uint64_t uuid_lo, - uint64_t uuid_hi, bool auth); -#endif /* CONFIG_TMEM */ - -#endif /* __XEN_TMEM_CONTROL_H__ */ - -/* - * Local variables: - * mode: C - * c-file-style: "BSD" - * c-basic-offset: 4 - * tab-width: 4 - * indent-tabs-mode: nil - * End: - */ diff --git a/xen/include/xen/tmem_xen.h b/xen/include/xen/tmem_xen.h deleted file mode 100644 index 8516a0b131..0000000000 --- a/xen/include/xen/tmem_xen.h +++ /dev/null @@ -1,343 +0,0 @@ -/****************************************************************************** - * tmem_xen.h - * - * Xen-specific Transcendent memory - * - * Copyright (c) 2009, Dan Magenheimer, Oracle Corp. - */ - -#ifndef __XEN_TMEM_XEN_H__ -#define __XEN_TMEM_XEN_H__ - -#include <xen/mm.h> /* heap alloc/free */ -#include <xen/pfn.h> -#include <xen/xmalloc.h> /* xmalloc/xfree */ -#include <xen/sched.h> /* struct domain */ -#include <xen/guest_access.h> /* copy_from_guest */ -#include <xen/hash.h> /* hash_long */ -#include <xen/domain_page.h> /* __map_domain_page */ -#include <xen/rbtree.h> /* struct rb_root */ -#include <xsm/xsm.h> /* xsm_tmem_control */ -#include <public/tmem.h> -#ifdef CONFIG_COMPAT -#include <compat/tmem.h> -#endif -typedef uint32_t pagesize_t; /* like size_t, must handle largest PAGE_SIZE */ - -#define IS_PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) -#define IS_VALID_PAGE(_pi) mfn_valid(page_to_mfn(_pi)) - -extern struct page_list_head tmem_page_list; -extern spinlock_t tmem_page_list_lock; -extern unsigned long tmem_page_list_pages; -extern atomic_t freeable_page_count; - -extern int tmem_init(void); -#define tmem_hash hash_long - -extern bool opt_tmem_compress; -static inline bool tmem_compression_enabled(void) -{ - return opt_tmem_compress; -} - -#ifdef CONFIG_TMEM -extern bool opt_tmem; -static inline bool tmem_enabled(void) -{ - return opt_tmem; -} - -static inline void tmem_disable(void) -{ - opt_tmem = false; -} -#else -static inline bool tmem_enabled(void) -{ - return false; -} - -static inline void tmem_disable(void) -{ -} -#endif /* CONFIG_TMEM */ - -/* - * Memory free page list management - */ - -static inline struct page_info *tmem_page_list_get(void) -{ - struct page_info *pi; - - spin_lock(&tmem_page_list_lock); - if ( (pi = page_list_remove_head(&tmem_page_list)) != NULL ) - tmem_page_list_pages--; - spin_unlock(&tmem_page_list_lock); - ASSERT((pi == NULL) || IS_VALID_PAGE(pi)); - return pi; -} - -static inline void tmem_page_list_put(struct page_info *pi) -{ - ASSERT(IS_VALID_PAGE(pi)); - spin_lock(&tmem_page_list_lock); - page_list_add(pi, &tmem_page_list); - tmem_page_list_pages++; - spin_unlock(&tmem_page_list_lock); -} - -/* - * Memory allocation for persistent data - */ -static inline struct page_info *__tmem_alloc_page_thispool(struct domain *d) -{ - struct page_info *pi; - - /* note that this tot_pages check is not protected by d->page_alloc_lock, - * so may race and periodically fail in donate_page or alloc_domheap_pages - * That's OK... neither is a problem, though chatty if log_lvl is set */ - if ( d->tot_pages >= d->max_pages ) - return NULL; - - if ( tmem_page_list_pages ) - { - if ( (pi = tmem_page_list_get()) != NULL ) - { - if ( donate_page(d,pi,0) == 0 ) - goto out; - else - tmem_page_list_put(pi); - } - } - - pi = alloc_domheap_pages(d,0,MEMF_tmem); - -out: - ASSERT((pi == NULL) || IS_VALID_PAGE(pi)); - return pi; -} - -static inline void __tmem_free_page_thispool(struct page_info *pi) -{ - struct domain *d = page_get_owner(pi); - - ASSERT(IS_VALID_PAGE(pi)); - if ( (d == NULL) || steal_page(d,pi,0) == 0 ) - tmem_page_list_put(pi); - else - { - scrub_one_page(pi); - ASSERT((pi->count_info & ~(PGC_allocated | 1)) == 0); - free_domheap_pages(pi,0); - } -} - -/* - * Memory allocation for ephemeral (non-persistent) data - */ -static inline struct page_info *__tmem_alloc_page(void) -{ - struct page_info *pi = tmem_page_list_get(); - - if ( pi == NULL) - pi = alloc_domheap_pages(0,0,MEMF_tmem); - - if ( pi ) - atomic_inc(&freeable_page_count); - ASSERT((pi == NULL) || IS_VALID_PAGE(pi)); - return pi; -} - -static inline void __tmem_free_page(struct page_info *pi) -{ - ASSERT(IS_VALID_PAGE(pi)); - tmem_page_list_put(pi); - atomic_dec(&freeable_page_count); -} - -/* "Client" (==domain) abstraction */ -static inline struct client *tmem_client_from_cli_id(domid_t cli_id) -{ - struct client *c; - struct domain *d = rcu_lock_domain_by_id(cli_id); - if (d == NULL) - return NULL; - c = d->tmem_client; - rcu_unlock_domain(d); - return c; -} - -/* these typedefs are in the public/tmem.h interface -typedef XEN_GUEST_HANDLE(void) cli_mfn_t; -typedef XEN_GUEST_HANDLE(char) cli_va_t; -*/ -typedef XEN_GUEST_HANDLE_PARAM(tmem_op_t) tmem_cli_op_t; -typedef XEN_GUEST_HANDLE_PARAM(char) tmem_cli_va_param_t; - -static inline int tmem_get_tmemop_from_client(tmem_op_t *op, tmem_cli_op_t uops) -{ -#ifdef CONFIG_COMPAT - if ( is_hvm_vcpu(current) ? hvm_guest_x86_mode(current) != 8 - : is_pv_32bit_vcpu(current) ) - { - int rc; - enum XLAT_tmem_op_u u; - tmem_op_compat_t cop; - - rc = copy_from_guest(&cop, guest_handle_cast(uops, void), 1); - if ( rc ) - return rc; - switch ( cop.cmd ) - { - case TMEM_NEW_POOL: u = XLAT_tmem_op_u_creat; break; - default: u = XLAT_tmem_op_u_gen ; break; - } - XLAT_tmem_op(op, &cop); - return 0; - } -#endif - return copy_from_guest(op, uops, 1); -} - -#define tmem_cli_buf_null guest_handle_from_ptr(NULL, char) -#define TMEM_CLI_ID_NULL ((domid_t)((domid_t)-1L)) -#define tmem_cli_id_str "domid" -#define tmem_client_str "domain" - -int tmem_decompress_to_client(xen_pfn_t, void *, size_t, - tmem_cli_va_param_t); -int tmem_compress_from_client(xen_pfn_t, void **, size_t *, - tmem_cli_va_param_t); - -int tmem_copy_from_client(struct page_info *, xen_pfn_t, tmem_cli_va_param_t); -int tmem_copy_to_client(xen_pfn_t, struct page_info *, tmem_cli_va_param_t); - -#define tmem_client_err(fmt, args...) printk(XENLOG_G_ERR fmt, ##args) -#define tmem_client_warn(fmt, args...) printk(XENLOG_G_WARNING fmt, ##args) -#define tmem_client_info(fmt, args...) printk(XENLOG_G_INFO fmt, ##args) - -/* Global statistics (none need to be locked). */ -struct tmem_statistics { - unsigned long total_tmem_ops; - unsigned long errored_tmem_ops; - unsigned long total_flush_pool; - unsigned long alloc_failed; - unsigned long alloc_page_failed; - unsigned long evicted_pgs; - unsigned long evict_attempts; - unsigned long relinq_pgs; - unsigned long relinq_attempts; - unsigned long max_evicts_per_relinq; - unsigned long low_on_memory; - unsigned long deduped_puts; - unsigned long tot_good_eph_puts; - int global_obj_count_max; - int global_pgp_count_max; - int global_pcd_count_max; - int global_page_count_max; - int global_rtree_node_count_max; - long global_eph_count_max; - unsigned long failed_copies; - unsigned long pcd_tot_tze_size; - unsigned long pcd_tot_csize; - /* Global counters (should use long_atomic_t access). */ - atomic_t global_obj_count; - atomic_t global_pgp_count; - atomic_t global_pcd_count; - atomic_t global_page_count; - atomic_t global_rtree_node_count; -}; - -#define atomic_inc_and_max(_c) do { \ - atomic_inc(&tmem_stats._c); \ - if ( _atomic_read(tmem_stats._c) > tmem_stats._c##_max ) \ - tmem_stats._c##_max = _atomic_read(tmem_stats._c); \ -} while (0) - -#define atomic_dec_and_assert(_c) do { \ - atomic_dec(&tmem_stats._c); \ - ASSERT(_atomic_read(tmem_stats._c) >= 0); \ -} while (0) - -#define MAX_GLOBAL_SHARED_POOLS 16 -struct tmem_global { - struct list_head ephemeral_page_list; /* All pages in ephemeral pools. */ - struct list_head client_list; - struct tmem_pool *shared_pools[MAX_GLOBAL_SHARED_POOLS]; - bool shared_auth; - long eph_count; /* Atomicity depends on eph_lists_spinlock. */ - atomic_t client_weight_total; -}; - -#define MAX_POOLS_PER_DOMAIN 16 - -struct tmem_pool; -struct tmem_page_descriptor; -struct tmem_page_content_descriptor; -struct client { - struct list_head client_list; - struct tmem_pool *pools[MAX_POOLS_PER_DOMAIN]; - struct domain *domain; - struct xmem_pool *persistent_pool; - struct list_head ephemeral_page_list; - long eph_count, eph_count_max; - domid_t cli_id; - xen_tmem_client_t info; - /* For save/restore/migration. */ - bool was_frozen; - struct list_head persistent_invalidated_list; - struct tmem_page_descriptor *cur_pgp; - /* Statistics collection. */ - unsigned long compress_poor, compress_nomem; - unsigned long compressed_pages; - uint64_t compressed_sum_size; - uint64_t total_cycles; - unsigned long succ_pers_puts, succ_eph_gets, succ_pers_gets; - /* Shared pool authentication. */ - uint64_t shared_auth_uuid[MAX_GLOBAL_SHARED_POOLS][2]; -}; - -#define POOL_PAGESHIFT (PAGE_SHIFT - 12) -#define OBJ_HASH_BUCKETS 256 /* Must be power of two. */ -#define OBJ_HASH_BUCKETS_MASK (OBJ_HASH_BUCKETS-1) - -#define is_persistent(_p) (_p->persistent) -#define is_shared(_p) (_p->shared) - -struct tmem_pool { - bool shared; - bool persistent; - bool is_dying; - struct client *client; - uint64_t uuid[2]; /* 0 for private, non-zero for shared. */ - uint32_t pool_id; - rwlock_t pool_rwlock; - struct rb_root obj_rb_root[OBJ_HASH_BUCKETS]; /* Protected by pool_rwlock. */ - struct list_head share_list; /* Valid if shared. */ - int shared_count; /* Valid if shared. */ - /* For save/restore/migration. */ - struct list_head persistent_page_list; - struct tmem_page_descriptor *cur_pgp; - /* Statistics collection. */ - atomic_t pgp_count; - int pgp_count_max; - long obj_count; /* Atomicity depends on pool_rwlock held for write. */ - long obj_count_max; - unsigned long objnode_count, objnode_count_max; - uint64_t sum_life_cycles; - uint64_t sum_evicted_cycles; - unsigned long puts, good_puts, no_mem_puts; - unsigned long dup_puts_flushed, dup_puts_replaced; - unsigned long gets, found_gets; - unsigned long flushs, flushs_found; - unsigned long flush_objs, flush_objs_found; -}; - -struct share_list { - struct list_head share_list; - struct client *client; -}; - -#endif /* __XEN_TMEM_XEN_H__ */ diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst index 527332054a..2aa238f41f 100644 --- a/xen/include/xlat.lst +++ b/xen/include/xlat.lst @@ -126,8 +126,6 @@ ? sched_pin_override sched.h ? sched_remote_shutdown sched.h ? sched_shutdown sched.h -? tmem_oid tmem.h -! tmem_op tmem.h ? t_buf trace.h ? vcpu_get_physid vcpu.h ? vcpu_register_vcpu_info vcpu.h -- 2.11.0 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.