[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code
Keir (and xen physical memory management experts) -- Alright, I think I am ready for the final step of plugging in tmem into the existing xen physical memory management code. ** This is a bit long, but I'd appreciate some design feedback before I proceed with this. And that requires a bit of background explanation... if this isn't enough background, I'll be happy to answer any questions. (Note that tmem is not intended to be deployed on a 32-bit hypervisor -- due to xenheap constraints -- and should port easily (though hasn't been ported yet) to ia64. It is currently controlled by a xen command-line option, default off; and it requires tmem-modified guests.) Tmem absorbs essentially all free memory on the machine for its use, but the vast majority of that memory can be easily freed, synchronously and on demand, for other uses. Tmem now maintains its own page list, tmem_page_list, which holds tmem pages when they (temporarily) don't contain data. (There's no sense scrubbing and freeing these to xenheap or domheap, when tmem is just going to grab them again and overwrite them anyway.) So tmem holds three types of memory: (1) Machine-pages (4K) on the tmem_page_list (2) Pages containing "ephemeral" data managed by tmem (3) Ppages containing "persistent" data managed by tmem Pages regularly move back and forth between ((2)or(3)) and (1) as part of tmem's normal operations. When a page is moved "involuntarily" from (2) to (1), we call this an "eviction". Note that, due to compression, evicting a tmem ephemeral data page does not necessarily free up a raw machine page (4K) of memory... partial pages are kept in a tmem-specific tlsf pool, and tlsf frees up the machine page when all allocations on it are freed. (tlsf is the mechanism underlying the new highly-efficient xmalloc added to xen-unstable late last year.) Now let's assume that Xen has need of memory but tmem has absorbed it all. Xen's demand is always one of the following: (here, a page is a raw machine page (4K)) A) a page B) a large number of individual non-consecutive pages C) a block of 2**N consecutive pages (order N > 0) Of these: (A) eventually finds its way to alloc_heap_pages() (B) happens in (at least) two circumstances: (i) when a new domain is created, and (ii) when a domain makes a balloon request. (C) happens mostly at system startup and then rarely after that (when? why? see below) Tmem will export this API: a) struct page_info *tmem_relinquish_page(void) b) struct page_info *tmem_relinquish_pageblock(int order) c) uint32_t tmem_evict_npages(uint32_t npages) d) uint32_t tmem_relinquish_pages(uint32_t npages) (a) and (b) are internal to the hypervisor. (c) and (d) are internal and accessible via privileged hypercall. (a) is fairly straightforward and synchronous, though it may be a bit slow since it has to scrub the page before returning. If there is a page in tmem_page_list, it will (scrub and) return it. If not, it will evict tmem ephemeral data until there is a page freed to tmem_page_list and then it will (scrub and) return it. If tmem has no more ephemeral pages to evict and there's nothing in tmem_page_list, it will return NULL. (a) can be used in, for example, alloc_heap_pages when "No suitable memory blocks" can be found so as to avoid failing the request. (b) is similar but is used if order > 0 (ie. a bigger chunk of pages is needed). It works the same way except that, due to fragmentation, it may have to evict MANY pages, in fact possibly ALL ephemeral data. Even then it still may not find enough consecutive pages to satisfy the request. Further, tmem doesn't use a buddy allocator... because it uses nothing larger than a machine page, it never needs one internally. So all of those pages need to be scrubbed and freed to the xen heap before it can be determined if the request can be satisfied. As a result, this is potentially VERY slow and still has a high probability of failure. Fortunately, requests for order>0 are, I think, rare. (c) and (d) are intentionally not combined. (c) evicts tmem ephemeral pages until it has added at least npages (machine pages) into the tmem_page_list. This may be slow. For (d), I'm thinking it will transfer npages from tmem_page_list to the scrub_list, where the existing page_scrub_timer will eventually scrub them and free them to xen's heap. (c) will return the number of pages it successfuly added to tmem_page_list. And (d) will return the number of pages it successfully moved from tmem_page_list to scrub_list. So this leaves some design questions: 1) Does this design make sense? 2) Are there places other than in alloc_heap_pages in Xen where I need to add "hooks" for tmem to relinquish a page or a block of pages? 3) Are there are any other circumstances I've forgotten where large npages are requested? 4) Does anybody have a list of alloc requests of order > 0 that occur after xen startup (e.g. when launching a new domain) and the consequences of failing the request? I'd consider not providing interface (b) at all if it never happens or if multi-page requests always fail gracefully (e.g. get broken into smaller order requests). I'm thinking for now that I may not implement this, just fail it and printk and see if any bad things happen. Thanks for taking the time to read through this... any feedback is appreciated. Dan ** tmem has been working for months but the code has until now allocated (and freed) to (and from) xenheap and domheap. This has been a security hole as the pages were released unscrubbed and so data could easily leak between domains. Obviously this needed to be fixed :-) And scrubbing data at every transfer from tmem to domheap/xenheap would be a huge waste of CPU cycles, especially since the most likely next consumer of that same page is tmem again. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |