[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH v2 3/3] docs: remove tmem related text
Signed-off-by: Wei Liu <wei.liu2@xxxxxxxxxx> --- docs/man/xl.conf.pod.5 | 9 +- docs/man/xl.pod.1.in | 68 ---- docs/misc/tmem-internals.html | 789 ------------------------------------ docs/misc/xen-command-line.markdown | 6 - docs/misc/xsm-flask.txt | 36 -- 5 files changed, 2 insertions(+), 906 deletions(-) delete mode 100644 docs/misc/tmem-internals.html diff --git a/docs/man/xl.conf.pod.5 b/docs/man/xl.conf.pod.5 index 37262a7ef8..b1bde7d657 100644 --- a/docs/man/xl.conf.pod.5 +++ b/docs/man/xl.conf.pod.5 @@ -148,10 +148,8 @@ The default choice is "xvda". =item B<claim_mode=BOOLEAN> If this option is enabled then when a guest is created there will be an -guarantee that there is memory available for the guest. This is an -particularly acute problem on hosts with memory over-provisioned guests -that use tmem and have self-balloon enabled (which is the default -option). The self-balloon mechanism can deflate/inflate the balloon +guarantee that there is memory available for the guest. +The self-balloon mechanism can deflate/inflate the balloon quickly and the amount of free memory (which C<xl info> can show) is stale the moment it is printed. When claim is enabled a reservation for the amount of memory (see 'memory' in xl.conf(5)) is set, which is then @@ -163,9 +161,6 @@ If the reservation cannot be meet the guest creation fails immediately instead of taking seconds/minutes (depending on the size of the guest) while the guest is populated. -Note that to enable tmem type guests, one needs to provide C<tmem> on the -Xen hypervisor argument and as well on the Linux kernel command line. - Default: C<1> =over 4 diff --git a/docs/man/xl.pod.1.in b/docs/man/xl.pod.1.in index 18006880d6..7c765dbc3c 100644 --- a/docs/man/xl.pod.1.in +++ b/docs/man/xl.pod.1.in @@ -1677,74 +1677,6 @@ Obtain information of USB devices connected as such via the device model =back -=head1 TRANSCENDENT MEMORY (TMEM) - -=over 4 - -=item B<tmem-list> I<[OPTIONS]> I<domain-id> - -List tmem pools. - -B<OPTIONS> - -=over 4 - -=item B<-l> - -If this parameter is specified, also list tmem stats. - -=back - -=item B<tmem-freeze> I<domain-id> - -Freeze tmem pools. - -=item B<tmem-thaw> I<domain-id> - -Thaw tmem pools. - -=item B<tmem-set> I<domain-id> [I<OPTIONS>] - -Change tmem settings. - -B<OPTIONS> - -=over 4 - -=item B<-w> I<WEIGHT> - -Weight (int) - -=item B<-p> I<COMPRESS> - -Compress (int) - -=back - -=item B<tmem-shared-auth> I<domain-id> [I<OPTIONS>] - -De/authenticate shared tmem pool. - -B<OPTIONS> - -=over 4 - -=item B<-u> I<UUID> - -Specify uuid (abcdef01-2345-6789-1234-567890abcdef) - -=item B<-a> I<AUTH> - -0=auth,1=deauth - -=back - -=item B<tmem-freeable> - -Get information about how much freeable memory (MB) is in-use by tmem. - -=back - =head1 FLASK B<FLASK> is a security framework that defines a mandatory access control policy diff --git a/docs/misc/tmem-internals.html b/docs/misc/tmem-internals.html deleted file mode 100644 index 9b7e70e650..0000000000 --- a/docs/misc/tmem-internals.html +++ /dev/null @@ -1,789 +0,0 @@ -<h1>Transcendent Memory Internals in Xen</h1> -<P> -by Dan Magenheimer, Oracle Corp.</p> -<P> -Draft 0.1 -- Updated: 20100324 -<h2>Overview</h2> -<P> -This document focuses on the internal implementation of -Transcendent Memory (tmem) on Xen. It assumes -that the reader has a basic knowledge of the terminology, objectives, and -functionality of tmem and also has access to the Xen source code. -It corresponds to the Xen 4.0 release, with -patch added to support page deduplication (V2). -<P> -The primary responsibilities of the tmem implementation are to: -<ul> -<li>manage a potentially huge and extremely dynamic -number of memory pages from a potentially large number of clients (domains) -with low memory overhead and proper isolation -<li>provide quick and efficient access to these -pages with as much concurrency as possible -<li>enable efficient reclamation and <i>eviction</i> of pages (e.g. when -memory is fully utilized) -<li>optionally, increase page density through compression and/or -deduplication -<li>where necessary, properly assign and account for -memory belonging to guests to avoid malicious and/or accidental unfairness -and/or denial-of-service -<li>record utilization statistics and make them available to management tools -</ul> -<h2>Source Code Organization</h2> - -<P> -The source code in Xen that provides the tmem functionality -is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h. -The files tmem.c and tmem.h are intended to -be implementation- (and hypervisor-) independent and the other two files -provide the Xen-specific code. This -division is intended to make it easier to port tmem functionality to other -hypervisors, though at this time porting to other hypervisors has not been -attempted. Together, these four files -total less than 4000 lines of C code. -<P> -Even ignoring the implementation-specific functionality, the -implementation-independent part of tmem has several dependencies on -library functionality (Xen source filenames in parentheses): -<ul> -<li> -a good fast general-purpose dynamic memory -allocator with bounded response time and efficient use of memory for a very -large number of sub-page allocations. To -achieve this in Xen, the bad old memory allocator was replaced with a -slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by -Nitin Gupta for compcache. -<li> -good tree data structure libraries, specifically -<i>red-black</i> trees (rbtree.c) and <i>radix</i> trees (radix-tree.c). -Code for these was borrowed for Linux and adapted for tmem and Xen. -<li> -good locking and list code. Both of these existed in Xen and required -little or no change. -<li> -optionally, a good fast lossless compression -library. The Xen implementation added to -support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta. -</ul> -<P> -More information about the specific functionality of these -libraries can easily be found through a search engine, via wikipedia, or in the -Xen or Linux source logs so we will not elaborate further here. - -<h2>Prefixes/Abbreviations/Glossary</h2> - -<P> -The tmem code uses several prefixes and abbreviations. -Knowledge of these will improve code readability: -<ul> -<li> -<i>tmh</i> == -transcendent memory host. Functions or -data structures that are defined by the implementation-specific code, i.e. the -Xen host code -<li> -<i>tmemc</i> -== transcendent memory control. -Functions or data structures that provide management tool functionality, -rather than core tmem operations. -<li> -<i>cli </i>or -<i>client</i> == client. -The tmem generic term for a domain or a guest OS. -</ul> -<P> -When used in prose, common tmem operations are indicated -with a different font, such as <big><kbd>put</kbd></big> -and <big><kbd>get</kbd></big>. - -<h2>Key Data Structures</h2> - -<P> -To manage a huge number of pages, efficient data structures -must be carefully selected. -<P> -Recall that a tmem-enabled guest OS may create one or more -pools with different attributes. It then -<kbd>put</kbd></big>s and <kbd>get</kbd></big>s -pages to/from this pool, identifying the page -with a <i>handle</i> that consists of a <i>pool_id</i>, an <i> -object_id</i>, and a <i>page_id </i>(sometimes -called an <i>index</i>). -This suggests a few obvious core data -structures: -<ul> -<li> -When a guest OS first calls tmem, a <i>client_t</i> is created to contain -and track all uses of tmem by that guest OS. Among -other things, a <i>client_t</i> keeps pointers -to a fixed number of pools (16 in the current Xen implementation). -<li> -When a guest OS requests a new pool, a <i>pool_t</i> is created. -Some pools are shared and are kept in a -sharelist (<i>sharelist_t</i>) which points -to all the clients that are sharing the pool. -Since an <i>object_id</i> is 64-bits, -a <i>pool_t</i> must be able to keep track -of a potentially very large number of objects. -To do so, it maintains a number of parallel trees (256 in the current -Xen implementation) and a hash algorithm is applied to the <i>object_id</i> -to select the correct tree. -Each tree element points to an object. -Because an <i>object_id</i> usually represents an <i>inode</i> -(a unique file number identifier), and <i>inode</i> numbers -are fairly random, though often "clumpy", a <i>red-black tree</i> -is used. -<li> -When a guest first -<kbd>put</kbd></big>s a page to a pool with an as-yet-unused <i>object_id,</i> an -<i>obj_t</i> is created. Since a <i ->page_id</i> is usually an index into a file, -it is often a small number, but may sometimes be very large (up to -32-bits). A <i>radix tree</i> is a good data structure to contain items -with this kind of index distribution. -<li> -When a page is -<kbd>put</kbd></big>, a page descriptor, or <i>pgp_t</i>, is created, which -among other things will point to the storage location where the data is kept. -In the normal case the pointer is to a <i>pfp_t</i>, which is an -implementation-specific datatype representing a physical pageframe in memory -(which in Xen is a "struct page_info"). -When deduplication is enabled, it points to -yet another data structure, a <i>pcd_</i>t -(see below). When compression is enabled -(and deduplication is not), the pointer points directly to the compressed data. -For reasons we will see shortly, each <i>pgp_t</i> that represents -an <i>ephemeral</i> page (that is, a page placed -in an <i>ephemeral</i> pool) is also placed -into two doubly-linked linked lists, one containing all ephemeral pages -<kbd>put</kbd></big> by the same client and one -containing all ephemeral pages across all clients ("global"). -<li> -When deduplication is enabled, multiple <i>pgp_</i>t's may need to point to -the same data, so another data structure (and level of indirection) is used -called a page content descriptor, or <i>pcd_t</i>. -Multiple page descriptors (<i>pgp_t</i>'s) may point to the same <i>pcd_t</i>. -The <i>pcd_t</i>, in turn, points to either a <i>pfp_t</i> -(if a full page of data), directly to a -location in memory (if the page has been compressed or trailing zeroes have -been eliminated), or even a NULL pointer (if the page contained all zeroes and -trailing zero elimination is enabled). -</ul> -<P> -The most apparent usage of this multi-layer web of data structures -is "top-down" because, in normal operation, the vast majority of tmem -operations invoked by a client are -<kbd>put</kbd></big>s and <kbd>get</kbd></big>s, which require the various -data structures to be walked starting with the <i>client_t</i>, then -a <i>pool_t</i>, then an <i>obj_t</i>, then a <i>pgd_t</i>. -However, there is another highly frequent tmem operation that is not -visible from a client: memory reclamation. -Since tmem attempts to use all spare memory in the system, it must -frequently free up, or <i>evict</i>, -pages. The eviction algorithm will be -explained in more detail later but, in brief, to free memory, ephemeral pages -are removed from the tail of one of the doubly-linked lists, which means that -all of the data structures associated with that page-to-be-removed must be -updated or eliminated and freed. As a -result, each data structure also contains a <i>back-pointer</i> -to its parent, for example every <i>obj_t</i> -contains a pointer to its containing <i>pool_t</i>. -<P> -This complex web of interconnected data structures is updated constantly and -thus extremely sensitive to careless code changes which, for example, may -result in unexpected hypervisor crashes or non-obvious memory leaks. -On the other hand, the code is fairly well -modularized so, once understood, it is possible to relatively easily switch out -one kind of data structure for another. -To catch problems as quickly as possible when debug is enabled, most of -the data structures are equipped with <i>sentinels</i>and many inter-function -assumptions are documented and tested dynamically -with <i>assertions</i>. -While these clutter and lengthen the tmem -code substantially, their presence has proven invaluable on many occasions. -<P> -For completeness, we should also describe a key data structure in the Xen -implementation-dependent code: the <i>tmh_page_list</i>. For security and -performance reasons, pages that are freed due to tmem operations (such -as <kbd>get</kbd></big>) are not immediately put back into Xen's pool -of free memory (aka the Xen <i>heap</i>). -Tmem pages may contain guest-private data that must be <i>scrubbed</i> before -those memory pages are released for the use of other guests. -But if a page is immediately re-used inside of tmem itself, the entire -page is overwritten with new data, so need not be scrubbed. -Since tmem is usually the most frequent -customer of the Xen heap allocation code, it would be a waste of time to scrub -a page, release it to the Xen heap, and then immediately re-allocate it -again. So, instead, tmem maintains -currently-unused pages of memory on its own free list, <i>tmh_page_list</i>, -and returns the pages to Xen only when non-tmem Xen -heap allocation requests would otherwise fail. - -<h2>Scalablility/Concurrency</h2> - -<P>Tmem has been designed to be highly scalable. -Since tmem access is invoked similarly in -many ways to asynchronous disk access, a "big SMP" tmem-aware guest -OS can, and often will, invoke tmem hypercalls simultaneously on many different -physical CPUs. And, of course, multiple -tmem-aware guests may independently and simultaneously invoke tmem -hypercalls. While the normal frequency -of tmem invocations is rarely extremely high, some tmem operations such as data -compression or lookups in a very large tree may take tens of thousands of -cycles or more to complete. Measurements -have shown that normal workloads spend no more than about 0.2% (2% with -compression enabled) of CPU time executing tmem operations. -But those familiar with OS scalability issues -recognize that even this limited execution time can create concurrency problems -in large systems and result in poorly-scalable performance. -<P> -A good locking strategy is critical to concurrency, but also -must be designed carefully to avoid deadlock and <i>livelock</i> problems. For -debugging purposes, tmem supports a "big kernel lock" which disables -concurrency altogether (enabled in Xen with "tmem_lock", but note -that this functionality is rarely tested and likely has bit-rotted). Infrequent -but invasive tmem hypercalls, such as pool creation or the control operations, -are serialized on a single <i>read-write lock</i>, called tmem_rwlock, -which must be held for writing. All other tmem operations must hold this lock -for reading, so frequent operations such as -<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big> can execute simultaneously -as long as no invasive operations are occurring. -<P> -Once a pool has been selected, there is a per-pool -read-write lock (<i>pool_rwlock</i>) which -must be held for writing if any transformative operations might occur within -that pool, such as when an<i> obj_t</i> is -created or destroyed. For the highly -frequent operation of finding an<i> obj_t</i> -within a pool, pool_rwlock must be held for reading. -<P> -Once an object has been selected, there is a per-object -spinlock (<i>obj_spinlock)</i>. -This is a spinlock rather than a read-write -lock because nearly all of the most frequent tmem operations (e.g. -<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big>) -are transformative, in -that they add or remove a page within the object. -This lock is generally taken whenever an -object lookup occurs and released when the tmem operation is complete. -<P> -Next, the per-client and global ephemeral lists are -protected by a single global spinlock (<i>eph_lists_</i>spinlock) -and the per-client persistent lists are also protected by a single global -spinlock (<i>pers_list_spinlock</i>). -And to complete the description of -implementation-independent locks, if page deduplication is enabled, all pages -for which the first byte match are contained in one of 256 trees that are -protected by one of 256 corresponding read-write locks -(<i>pcd_tree_rwlocks</i>). -<P> -In the Xen-specific code (tmem_xen.c), page frames (e.g. struct page_info) -that have been released are kept in a list (<i>tmh_page_list</i>) that -is protected by a spinlock (<i>tmh_page_list_lock</i>). -There is also an "implied" lock -associated with compression, which is likely the most time-consuming operation -in all of tmem (of course, only when compression is enabled): A compression -buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to -this buffer is returned to implementation-independent code and used without a -lock. -<P> -The proper method to avoid deadlocks is to take and release -locks in a very specific predetermined order. -Unfortunately, since tmem data structures must simultaneously be -accessed "top-down" ( -<kbd>put</kbd></big> and <kbd>get</kbd></big>) -and "bottoms-up" -(memory reclamation), more complex methods must be employed: -A <i>trylock</i>mechanism is used (c.f. <i>tmem_try_to_evict_pgp()</i>), -which takes the lock if it is available but returns immediately (rather than -spinning and waiting) if the lock is not available. -When walking the ephemeral list to identify -pages to free, any page that belongs to an object that is locked is simply -skipped. Further, if the page is the -last page belonging to an object, and the pool read-write lock for the pool the -object belongs to is not available (for writing), that object is skipped. -These constraints modify the LRU algorithm -somewhat, but avoid the potential for deadlock. -<P> -Unfortunately, a livelock was still discovered in this approach: -When memory is scarce and each client is -<kbd>put</kbd></big>ting a large number of pages -for exactly one object (and thus holding the object spinlock for that object), -memory reclamation takes a very long time to determine that it is unable to -free any pages, and so the time to do a -<kbd>put</kbd></big> (which eventually fails) becomes linear to the -number of pages in the object! To avoid -this situation, a workaround was added to always ensure a minimum amount of -memory (1MB) is available before any object lock is taken for the client -invoking tmem (see <i>tmem_ensure_avail_pages()</i>). -Other such livelocks (and perhaps deadlocks) -may be lurking. -<P> -A last issue related to concurrency is atomicity of counters. -Tmem gathers a large number of -statistics. Some of these counters are -informational only, while some are critical to tmem operation and must be -incremented and decremented atomically to ensure, for example, that the number -of pages in a tree never goes negative if two concurrent tmem operations access -the counter exactly simultaneously. Some -of the atomic counters are used for debugging (in assertions) and perhaps need -not be atomic; fixing these may increase performance slightly by reducing -cache-coherency traffic. Similarly, some -of the non-atomic counters may yield strange results to management tools, such -as showing the total number of successful -<kbd>put</kbd></big>s as being higher than the number of -<kbd>put</kbd></big>s attempted. -These are left as exercises for future tmem implementors. - -<h2>Control and Manageability</h2> - -<P> -Tmem has a control interface to, for example, set various -parameters and obtain statistics. All -tmem control operations funnel through <i>do_tmem_control()</i> -and other functions supporting tmem control operations are prefixed -with <i>tmemc_</i>. - -<P> -During normal operation, even if only one tmem-aware guest -is running, tmem may absorb nearly all free memory in the system for its own -use. Then if a management tool wishes to -create a new guest (or migrate a guest from another system to this one), it may -notice that there is insufficient "free" memory and fail the creation -(or migration). For this reason, tmem -introduces a new tool-visible class of memory -- <i>freeable</i> memory -- -and provides a control interface to access -it. All ephemeral memory and all pages on the <i>tmh_page_list</i> -are freeable. To properly access freeable -memory, a management tool must follow a sequence of steps: -<ul> -<li> -<i>freeze</i> -tmem:When tmem is frozen, all -<kbd>put</kbd></big>s fail, which ensures that no -additional memory may be absorbed by tmem. -(See <i>tmemc_freeze_pools()</i>, and -note that individual clients may be frozen, though this functionality may be -used only rarely.) -<li> -<i>query freeable MB: </i>If all freeable memory were released to the Xen -heap, this is the amount of memory (in MB) that would be freed. -See <i>tmh_freeable_pages()</i>. -<li> -<i>flush</i>: -Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g. -back to the Xen heap. This amount is -specified in KB. See <i ->tmemc_flush_mem()</i> and <i ->tmem_relinquish_npages()</i>. -<li> -At this point the management tool may allocate -the memory, e.g. using Xen's published interfaces. -<li> -<i>thaw</i> -tmem: This terminates the freeze, allowing tmem to accept -<kbd>put</kbd></big>s again. -</ul> -<P> -Extensive tmem statistics are available through tmem's -control interface (see <i>tmemc_list </i>and -the separate source for the "xm tmem-list" command and the -xen-tmem-list-parse tool). To maximize -forward/backward compatibility with future tmem and tools versions, statistical -information is passed via an ASCII interface where each individual counter is -identified by an easily parseable two-letter ASCII sequence. - -<h2>Save/Restore/Migrate</h2> - -<P> -Another piece of functionality that has a major impact on -the tmem code is support for save/restore of a tmem client and, highly related, -live migration of a tmem client. -Ephemeral pages, by definition, do not need to be saved or -live-migrated, but persistent pages are part of the state of a running VM and -so must be properly preserved. -<P> -When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step -is for the tmem client to be frozen (see the manageability section). -Next, tmem API version information is -recorded (to avoid possible incompatibility issues as the tmem spec evolves in -the future). Then, certain high-level -tmem structural information specific to the client is recorded, including -information about the existing pools. -Finally, the contents of all persistent pages are recorded. -<P> -For live-migration, the process is somewhat more complicated. -Ignoring tmem for a moment, recall that in -live migration, the vast majority of the VM's memory is transferred while the -VM is still fully operational. During -each phase, memory pages belonging to the VM that are changed are marked and -then retransmitted during a later phase. -Eventually only a small amount of memory remains, the VM is paused, the -remaining memory is transmitted, and the VM is unpaused on the target machine. -<P> -The number of persistent tmem pages may be quite large, -possibly even larger than all the other memory used by the VM; so it is -unacceptable to transmit persistent tmem pages during the "paused" -phase of live migration. But if the VM -is still operational, it may be making calls to tmem: -A frozen tmem client will reject any -<big><kbd>put</kbd></big> operations, but tmem must -still correctly process <big><kbd>flush</kbd></big>es -(page and object), including implicit flushes due to duplicate -<big><kbd>put</kbd></big>s. -Fortunately, these operations can only -invalidate tmem pages, not overwrite tmem pages or create new pages. -So, when a live-migrate has been initiated, -the client is frozen. Then during the -"live" phase, tmem transmits all persistent pages, but also records -the handle of all persistent pages that are invalidated. -Then, during the "paused" phase, -only the handles of invalidated persistent pages are transmitted, resulting in -the invalidation on the target machine of any matching pages that were -previously transmitted during the "live" phase. -<P> -For restore (and on the target machine of a live migration), -tmem must be capable of reconstructing the internal state of the client from -the saved/migrated data. However, it is -not the client itself that is <big><kbd>put</kbd></big>'ing -the pages but the management tools conducting the restore/migration. -This slightly complicates tmem by requiring -new API calls and new functions in the implementation, but the code is -structured so that duplication is minimized. -Once all tmem data structures for the client are reconstructed, all -persistent pages are recreated and, in the case of live-migration, all -invalidations have been processed and the client has been thawed, the restored -client can be resumed. -<P> -Finally, tmem's data structures must be cluttered a bit to -support save/restore/migration. Notably, -a per-pool list of persistent pages must be maintained and, during live -migration, a per-client list of invalidated pages must be logged. -A reader of the code will note that these -lists are overlaid into space-sensitive data structures as a union, which may -be more error-prone but eliminates significant space waste. - -<h2>Miscellaneous Tmem Topics</h2> - -<P> -<i><b>Duplicate <big><kbd>puts</kbd></big></b></i>. -One interesting corner case that -significantly complicates the tmem source code is the possibility -of a <i>duplicate</i> -<big><kbd>put</kbd></big>, -which occurs when two -<big><kbd>put</kbd></big>s -are requested with the same handle but with possibly different data. -The tmem API addresses -<i> -<big><kbd>put</kbd></big>-<big><kbd>put</kbd></big>-<big><kbd>get</kbd></big> -coherence</i> explicitly: When a duplicate -<big><kbd>put</kbd></big> occurs, tmem may react one of two ways: (1) The -<big><kbd>put</kbd></big> may succeed with the old -data overwritten by the new data, or (2) the -<big><kbd>put</kbd></big> may be failed with the original data flushed and -neither the old nor the new data accessible. -Tmem may <i>not</i> fail the -<big><kbd>put</kbd></big> and leave the old data accessible. -<P> -When tmem has been actively working for an extended period, -system memory may be in short supply and it is possible for a memory allocation -for a page (or even a data structure such as a <i>pgd_t</i>) to fail. Thus, -for a duplicate -<big><kbd>put</kbd></big>, it may be impossible for tmem to temporarily -simultaneously maintain data structures and data for both the original -<big><kbd>put</kbd></big> and the duplicate -<big><kbd>put</kbd></big>. -When the space required for the data is -identical, tmem may be able to overwrite <i>in place </i>the old data with -the new data (option 1). But in some circumstances, such as when data -is being compressed, overwriting is not always possible and option 2 must be -performed. -<P> -<i><b>Page deduplication and trailing-zero elimination.</b></i> -When page deduplication is enabled -("tmem_dedup" option to Xen), ephemeral pages for which the contents -are identical -- whether the pages belong -to the same client or different clients -- utilize the same pageframe of -memory. In Xen environments where -multiple domains have a highly similar workload, this can save a substantial -amount of memory, allowing a much larger number of ephemeral pages to be -used. Tmem page deduplication uses -methods similar to the KSM implementation in Linux [ref], but differences between -the two are sufficiently great that tmem does not directly leverage the -code. In particular, ephemeral pages in -tmem are never dirtied, so need never be <i>copied-on-write</i>. -Like KSM, however, tmem avoids hashing, -instead employing <i>red-black trees</i> -that use the entire page contents as the <i>lookup -key</i>. There may be better ways to implement this. -<P> -Dedup'ed pages may optionally be compressed -("tmem_compress" and "tmem_dedup" Xen options specified), -to save even more space, at the cost of more time. -Additionally, <i>trailing zero elimination (tze)</i> may be applied to dedup'ed -pages. With tze, pages that contain a -significant number of zeroes at the end of the page are saved without the trailing -zeroes; an all-zero page requires no data to be saved at all. -In certain workloads that utilize a large number -of small files (and for which the last partial page of a file is padded with -zeroes), a significant space savings can be realized without the high cost of -compression/decompression. -<P> -Both compression and tze significantly complicate memory -allocation. This will be discussed more below. -<P> -<b><i>Memory accounting</i>.</b> -Accounting is boring, but poor accounting may -result in some interesting problems. In -the implementation-independent code of tmem, most data structures, page frames, -and partial pages (e.g. for compresssion) are <i>billed</i> to a pool, -and thus to a client. Some <i>infrastructure</i> data structures, such as -pools and clients, are allocated with <i>tmh_alloc_infra()</i>, which does not -require a pool to be specified. Two other -exceptions are page content descriptors (<i>pcd_t</i>) -and sharelists (<i>sharelist_t</i>) which -are explicitly not associated with a pool/client by specifying NULL instead of -a <i>pool_t</i>. -(Note to self: -These should probably just use the <i>tmh_alloc_infra()</i> interface too.) -As we shall see, persistent pool pages and -data structures may need to be handled a bit differently, so the -implementation-independent layer calls a different allocation/free routine for -persistent pages (e.g. <i>tmh_alloc_page_thispool()</i>) -than for ephemeral pages (e.g. <i>tmh_alloc_page()</i>). -<P> -In the Xen-specific layer, we -disregard the <i>pool_t</i> for ephemeral -pages, as we use the generic Xen heap for all ephemeral pages and data -structures.(Denial-of-service attacks -can be handled in the implementation-independent layer because ephemeral pages -are kept in per-client queues each with a counted length. -See the discussion on weights and caps below.) -However we explicitly bill persistent pages -and data structures against the client/domain that is using them. -(See the calls to the Xen routine <i>alloc_domheap_pages() </i>in tmem_xen.h; of -the first argument is a domain, the pages allocated are billed by Xen to that -domain.)This means that a Xen domain -cannot allocate even a single tmem persistent page when it is currently utilizing -its maximum assigned memory allocation! -This is reasonable for persistent pages because, even though the data is -not directly accessible by the domain, the data is permanently saved until -either the domain flushes it or the domain dies. -<P> -Note that proper accounting requires (even for ephemeral pools) that the same -pool is referenced when memory is freed as when it was allocated, even if the -ownership of a pool has been moved from one client to another (c.f. <i ->shared_pool_reassign()</i>). -The underlying Xen-specific information may -not always enforce this for ephemeral pools, but incorrect alloc/free matching -can cause some difficult-to-find memory leaks and bent pointers. -<P> -Page deduplication is not possible for persistent pools for -accounting reasons: Imagine a page that is created by persistent pool A, which -belongs to a domain that is currently well under its maximum allocation. -Then the <i>pcd_t</i>is matched by persistent pool B, which is -currently at its maximum. -Then the domain owning pool A is destroyed. -Is B beyond its maximum? -(There may be a clever way around this -problem. Exercise for the reader!) -<P> -<b><i>Memory allocation.</i></b> The implementation-independent layer assumes -there is a good fast general-purpose dynamic memory allocator with bounded -response time and efficient use of memory for a very large number of sub-page -allocations. The old xmalloc memory -allocator in Xen was not a good match for this purpose, so was replaced by the -TLSF allocator. Note that the TLSF -allocator is used only for allocations smaller than a page (and, more -precisely, no larger than <i>tmem_subpage_maxsize()</i>); -full pages are allocated by Xen's normal heap allocator. -<P> -After the TLSF allocator was integrated into Xen, more work -was required so that each client could allocate memory from a separate -independent pool. (See the call to <i>xmem_pool_create()</i>in -<i>tmh_client_init()</i>.) -This allows the data structures allocated for the -purpose of supporting persistent pages to be billed to the same client as the -pages themselves. It also allows partial -(e.g. compressed) pages to be properly billed. -Further, when partial page allocations cause internal fragmentation, -this fragmentation can be isolated per-client. -And, when a domain dies, full pages can be freed, rather than only -partial pages. One other change was -required in the TLSF allocator: In the original version, when a TLSF memory -pool was allocated, the first page of memory was also allocated. -Since, for a persistent pool, this page would -be billed to the client, the allocation of the first page failed if the domain -was started at its maximum memory, and this resulted in a failure to create the -memory pool. To avoid this, the code was -changed to delay the allocation of the first page until first use of the memory -pool. -<P> -<b><i>Memory allocation interdependency.</i></b> -As previously described, -pages of memory must be moveable back and forth between the Xen heap and the -tmem ephemeral lists (and page lists). -When tmem needs a page but doesn't have one, it requests one from the -Xen heap (either indirectly via xmalloc, or directly via Xen's <i ->alloc_domheap_pages()</i>). -And when Xen needs a page but doesn't have -one, it requests one from tmem (via a call to <i ->tmem_relinquish_pages()</i> in Xen's <i ->alloc_heap_pages() </i>in page_alloc.c). -This leads to a potential infinite loop! -To break this loop, a new memory flag (<i>MEMF_tmem</i>) was added to Xen -to flag and disallow the loop. -See <i>tmh_called_from_tmem()</i> -in <i>tmem_relinquish_pages()</i>. -Note that the <i ->tmem_relinquish_pages()</i> interface allows for memory requests of -order > 0 (multiple contiguous pages), but the tmem implementation disallows -any requests larger than a single page. -<P> -<b><i>LRU page reclamation</i></b>. -Ephemeral pages generally <i>age </i>in -a queue, and the space associated with the oldest -- or <i ->least-recently-used -- </i>page is reclaimed when tmem needs more -memory. But there are a few exceptions -to strict LRU queuing. First is when -removal from a queue is constrained by locks, as previously described above. -Second, when an ephemeral pool is <i>shared,</i> unlike a private ephemeral -pool, a -<big><kbd>get</kbd></big> -does not imply a -<big><kbd>flush</kbd></big> -Instead, in a shared pool, a -results in the page being promoted to the front of the queue. -Third, when a page that is deduplicated (i.e. -is referenced by more than one <i>pgp_</i>t) -reaches the end of the LRU queue, it is marked as <i ->eviction attempted</i> and promoted to the front of the queue; if it -reaches the end of the queue a second time, eviction occurs. -Note that only the <i ->pgp_</i>t is evicted; the actual data is only reclaimed if there is no -other <i>pgp_t </i>pointing to the data. -<P> -All of these modified- LRU algorithms deserve to be studied -carefully against a broad range of workloads. -<P> -<b><i>Internal fragmentation</i>.</b> -When -compression or tze is enabled, allocations between a half-page and a full-page -in size are very common and this places a great deal of pressure on even the -best memory allocator. Additionally, -problems may be caused for memory reclamation: When one tmem ephemeral page is -evicted, only a fragment of a physical page of memory might be reclaimed. -As a result, when compression or tze is -enabled, it may take a very large number of eviction attempts to free up a full -contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction -must be assumed to be able to fail. -While all memory allocation paths in tmem are resilient to failure, very -complex corner cases may eventually occur. -As a result, compression and tze are disabled by default and should be -used with caution until they have been tested with a much broader set of -workloads.(Note to self: The -code needs work.) -<P> -<b><i>Weights and caps</i>.</b> -Because -of the just-discussed LRU-based eviction algorithms, a client that uses tmem at -a very high frequency can quickly swamp tmem so that it provides little benefit -to a client that uses it less frequently. -To reduce the possibility of this denial-of-service, limits can be -specified via management tools that are enforced internally by tmem. -On Xen, the "xm tmem-set" command -can specify "weight=<weight>" or "cap=<cap>" -for any client. If weight is non-zero -for a client and the current percentage of ephemeral pages in use by the client -exceeds its share (as measured by the sum of weights of all clients), the next -page chosen for eviction is selected from the requesting client's ephemeral -queue, instead of the global ephemeral queue that contains pages from all -clients.(See <i>client_over_quota().</i>) -Setting a cap for a client is currently a no-op. -<P> -<b><i>Shared pools and authentication.</i></b> -When tmem was first proposed to the linux kernel mailing list -(LKML), there was concern expressed about security of shared ephemeral -pools. The initial tmem implementation only -required a client to provide a 128-bit UUID to identify a shared pool, and the -linux-side tmem implementation obtained this UUID from the superblock of the -shared filesystem (in ocfs2). It was -pointed out on LKML that the UUID was essentially a security key and any -malicious domain that guessed it would have access to any data from the shared -filesystem that found its way into tmem. -Ocfs2 has only very limited security; it is assumed that anyone who can -access the filesystem bits on the shared disk can mount the filesystem and use -it. But in a virtualized data center, -higher isolation requirements may apply. -As a result, management tools must explicitly authenticate (or may -explicitly deny) shared pool access to any client. -On Xen, this is done with the "xl -tmem-shared-auth" command. -<P> -<b><i>32-bit implementation</i>.</b> -There was some effort put into getting tmem working on a 32-bit Xen. -However, the Xen heap is limited in size on -32-bit Xen so tmem did not work very well. -There are still 32-bit ifdefs in some places in the code, but things may -have bit-rotted so using tmem on a 32-bit Xen is not recommended. - -<h2>Known Issues</h2> - -<p><b><i>Fragmentation.</i></b>When tmem -is active, all physically memory becomes <i>fragmented</i> -into individual pages. However, the Xen -memory allocator allows memory to be requested in multi-page contiguous -quantities, called order>0 allocations. -(e.g. 2<sup>order</sup> so -order==4 is sixteen contiguous pages.) -In some cases, a request for a larger order will fail gracefully if no -matching contiguous allocation is available from Xen. -As of Xen 4.0, however, there are several -critical order>0 allocation requests that do not fail gracefully. -Notably, when a domain is created, and -order==4 structure is required or the domain creation will fail. -And shadow paging requires many order==2 -allocations; if these fail, a PV live-migration may fail. -There are likely other such issues. -<P> -But, fragmentation can occur even without tmem if any domU does -any extensive ballooning; tmem just accelerates the fragmentation. -So the fragmentation problem must be solved -anyway. The best solution is to disallow -order>0 allocations altogether in Xen -- or at least ensure that any attempt -to allocate order>0 can fail gracefully, e.g. by falling back to a sequence -of single page allocations. However this restriction may require a major rewrite -in some of Xen's most sensitive code. -(Note that order>0 allocations during Xen boot and early in domain0 -launch are safe and, if dom0 does not enable tmem, any order>0 allocation by -dom0 is safe, until the first domU is created.) -<P> -Until Xen can be rewritten to be <i>fragmentation-safe</i>, a small hack -was added in the Xen page -allocator.(See the comment " -memory is scarce" in <i>alloc_heap_pages()</i>.) -Briefly, a portion of memory is pre-reserved -for allocations where order>0 and order<9. -(Domain creation uses 2MB pages, but fails -gracefully, and there are no other known order==9 allocations or order>9 -allocations currently in Xen.) -<P> -<b><i>NUMA</i></b>. Tmem assumes that -all memory pages are equal and any RAM page can store a page of data for any -client. This has potential performance -consequences in any NUMA machine where access to <i ->far memory</i> is significantly slower than access to <i ->near memory</i>. -On nearly all of today's servers, however, -access times to <i>far memory</i> is still -much faster than access to disk or network-based storage, and tmem's primary performance -advantage comes from the fact that paging and swapping are reduced. -So, the current tmem implementation ignores -NUMA-ness; future tmem design for NUMA machines is an exercise left for the -reader. - -<h2>Bibliography</h2> - -<P> -(needs work)<b style='mso-bidi-font-weight:> -<P><a href="http://oss.oracle.com/projects/tmem">http://oss.oracle.com/projects/tmem</a> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown index 9028bcde2e..fe891ef074 100644 --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -1993,12 +1993,6 @@ pages) must also be specified via the tbuf\_size parameter. ### timer\_slop > `= <integer>` -### tmem -> `= <boolean>` - -### tmem\_compress -> `= <boolean>` - ### tsc (x86) > `= unstable | skewed | stable:socket` diff --git a/docs/misc/xsm-flask.txt b/docs/misc/xsm-flask.txt index 62f15dde84..40e5fc845e 100644 --- a/docs/misc/xsm-flask.txt +++ b/docs/misc/xsm-flask.txt @@ -81,42 +81,6 @@ __HYPERVISOR_memory_op (xen/include/public/memory.h) * XENMEM_get_pod_target * XENMEM_claim_pages -__HYPERVISOR_tmem_op (xen/include/public/tmem.h) - - The following tmem control ops, that is the sub-subops of - TMEM_CONTROL, are covered by this statement. - - Note that TMEM is also subject to a similar policy arising from - XSA-15 http://lists.xen.org/archives/html/xen-announce/2012-09/msg00006.html. - Due to this existing policy all TMEM Ops are already subject to - reduced security support. - - * TMEMC_THAW - * TMEMC_FREEZE - * TMEMC_FLUSH - * TMEMC_DESTROY - * TMEMC_LIST - * TMEMC_SET_WEIGHT - * TMEMC_SET_CAP - * TMEMC_SET_COMPRESS - * TMEMC_QUERY_FREEABLE_MB - * TMEMC_SAVE_BEGIN - * TMEMC_SAVE_GET_VERSION - * TMEMC_SAVE_GET_MAXPOOLS - * TMEMC_SAVE_GET_CLIENT_WEIGHT - * TMEMC_SAVE_GET_CLIENT_CAP - * TMEMC_SAVE_GET_CLIENT_FLAGS - * TMEMC_SAVE_GET_POOL_FLAGS - * TMEMC_SAVE_GET_POOL_NPAGES - * TMEMC_SAVE_GET_POOL_UUID - * TMEMC_SAVE_GET_NEXT_PAGE - * TMEMC_SAVE_GET_NEXT_INV - * TMEMC_SAVE_END - * TMEMC_RESTORE_BEGIN - * TMEMC_RESTORE_PUT_PAGE - * TMEMC_RESTORE_FLUSH_PAGE - - Setting up FLASK ---------------- -- 2.11.0 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |