[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] vNUMA and automatic numa balancing
Hi During work on vNUMA enabling patchset and PV domain test runs, I found out that once NUMA is enabled for PV guest, occasional oopses appear. This was caused my missed set_pmd_at function in pv_mmu_ops. Such behaviour does not appear if numa_autobalance is turned off for PV kernel, since set_pmd_at was not used. I have added set_pmd_at wich sets correct flags for numa autobalancing to work. You can check on automatic numa balancing here http://lwn.net/Articles/528881/. After set_pmd_at was added, the issue with rss reference count appeared. This problem was related to exit_mmap and not releasing vma areas correctly what had pages marked with _PAGE_PRESENT cleared and _PAGE_NUMA set, but never actually migrated. See http://pastebin.com/eFP5zc62 The test what shows is by executing the following command: dd if=/dev/xvda1 of=/tee bs=4096 count=1000056 & Depending on configuration and command, the rss count is not correctly set when xen_exit_mmap for different MM counters. In the example above you see that this is rss count for MM_MMAPANON, but in other cases it can be MM_FILEPAGES and MM_PAGESANON. Maybe this rss count for mm is missing for some other reasons. Another bit added when I forced PV kernel to substitue _PAGE_NUMA bit right before issuing mmu_update. Why? _PAGE_NUMA = _PAGE_PROTNONE = 0x100 in Linux its _PAGE_GLOBAL = 0x100 in Xen and is used for user mappings. It is allowed to be set on ptes so I decided to hand over to Xen instead of 0x100 (_PAGE_NUMA) unused bit _PAGE_AVAIL2 = 0x800. So all mmu pvops were updates to translate that bit and Xen will see it as 0x800 bit set for numa pte/pmd entries. I had to also make a brutal hack (as a proof of concept) for pmd_numa/pte_numa checks as this new flag after page fault trap in Xen is not set back to Linux _PAGE_NUMA. Initially the plan was to flip it back in Xen page fault trap, but I was unable to reliably identify where exactly in page fault handler this check should be and how to handle this. I am not sure if this is correct way, so all suggestions are welcome. Ok, having that all in place and basically letting Linux to have page_fault on numa pages run by correcting pmd_numa check (do_page_fault will do do_numa_page ) and will launch pages migration if some accumulated number of page faults on _PAGE_NUMA exeeds whateve threshhold. I see this recursive fault when running vNUMA PV domain: http://pastebin.com/cxy1j4u1 I see that on stack in the first oops there are two values wich are page fault exception codes.. [ 2.275054] ffffffff810f5756 ffffffff81639ad0 0000000000000010 0000000000000001 The last one would mean absense of page, and first means _PAGE_WRITE and _PAGE_PRESENT is not set (that means that page is resident, but not accessible)? Does anybody has any points on this if I am missing anything? Looks like the second exception page not present is also not handled properly in this case. I welcome any comments and questions. I will also provide additional details if some parts of this seem to be not clear. couple of questions about Xen page fault trap: a) In spurious_page_fault/__page_fault_type in traps.c there is a page walk performed and page table entries are being compared with required_flags field. It includes _PAGE_PRESENT for all levels of page entries from l4 to l1. How this flag is set for l4 differ from one for l2 or l1? Or it is the same interpretation that means that the following level of page table entries should be checked as the one that caused page fault? b) In spurious_page_fault routine the page fault is not supposed to be fixed, as I understand, but just to determine the type (real, smep.. ) If the spurious page fault detects the real fault, then it might be transparently fixed later or return to the guest handler as is. Is this correct? Thank you! -- Elena _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |