Xen project Mailing List

Re: [Xen-devel] Assertion 'l1e_get_pfn(MAPCACHE_L1ENT(hashent->idx)) == hashent->mfn' failed at domain_page.c:203

To: "Konrad Rzeszutek Wilk" <konrad.wilk@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Fri, 06 Dec 2013 16:28:13 +0000

Delivery-date: Fri, 06 Dec 2013 16:28:31 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 02.12.13 at 21:33, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote: > (XEN) ----[ Xen-4.4-unstable x86_64 debug=y Not tainted ]---- > (XEN) CPU: 6 > (XEN) RIP: e008:[<ffff82d08016187b>] map_domain_page+0x1fb/0x4af > (XEN) RFLAGS: 0000000000010087 CONTEXT: hypervisor > (XEN) rax: 0000000000244dbd rbx: ffff83042cb59000 rcx: ffff810000000000 > (XEN) rdx: 000000f820060006 rsi: 0000004100200090 rdi: 0000000000000000 > (XEN) rbp: ffff83042cb67db8 rsp: ffff83042cb67d78 r8: 00000000deadbeef > (XEN) r9: 00000000deadbeef r10: ffff82d08023d160 r11: 0000000000000246 > (XEN) r12: ffff8300ba712000 r13: 0000000000244dbd r14: 0000000000000012 > (XEN) r15: 0000000000000005 cr0: 0000000080050033 cr4: 00000000000406f0 > (XEN) cr3: 00000002e03c2000 cr2: 000000370d4de180 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff83042cb67d78: > (XEN) 0000000000000f2a 0000000000000286 0000003c6c3d8dea 0000000000244dbd > (XEN) ffff82e00489b7a0 0000000000000000 ffff880026625c60 0000000000000000 > (XEN) ffff83042cb67ef8 ffff82d08017b69f ffff83042cb67dd8 ffff82d08015cc0b > (XEN) ffff83042cb67e38 ffff82d080160a8b 0000000000000000 0000000000000000 > (XEN) 0000000000000000 ffff83042cb67ea8 0000000000000000 0000000000244dbd > (XEN) ffff8300ba712000 0000000000000000 0000000000000000 ffff820040069240 > (XEN) 00007ff000000000 0000000000000000 ffff82e00489b7a0 ffff83042cb59000 > (XEN) ffff83042cb67eb8 ffff83042cb60000 ffff83042cb60000 0000000500000000 > (XEN) ffff83042cb59000 ffff8300ba712000 ffff83042cb59000 0000000500000001 > (XEN) ffff83042cb67f08 0000000000000000 ffff83042cb67f18 00000000ba712000 > (XEN) 0000000244dbd6f0 0000000417a0e025 ffff83042cb67f08 ffff8300ba712000 > (XEN) ffff88011a98f6f0 0000000417a0e025 0000000000000000 0000000417a0e025 > (XEN) 00007cfbd34980c7 ffff82d0802248db ffffffff8100102a 0000000000000001 > (XEN) 0000000001e097f8 0000000001dc2010 0000000001dc77e0 0000000000000000 > (XEN) ffff880026625c98 00000000000006f0 0000000000000246 0000000000007ff0 > (XEN) ffffea00044a41dc 0000000000000000 0000000000000001 ffffffff8100102a > (XEN) 0000000000000000 0000000000000001 ffff880026625c60 0001010000000000 > (XEN) ffffffff8100102a 000000000000e033 0000000000000246 ffff880026625c48 > (XEN) 000000000000e02b ffffffffffffbeef ffffffffffffbeef ffffffffffffbeef > (XEN) ffffffffffffbeef ffffffff00000006 ffff8300ba712000 00000033ac85d080 > (XEN) Xen call trace: > (XEN) [<ffff82d08016187b>] map_domain_page+0x1fb/0x4af > (XEN) [<ffff82d08017b69f>] do_mmu_update+0x6cb/0x19aa > (XEN) [<ffff82d0802248db>] syscall_enter+0xeb/0x145 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 6: > (XEN) Assertion 'l1e_get_pfn(MAPCACHE_L1ENT(idx)) == mfn' failed at > domain_page.c:94 > (XEN) **************************************** This second one provided more information than the first one, and makes clear that the assertion indeed caught some (earlier) corruption. The relevant piece of code from map_domain_page() is (with actual value annotations) FFFF82D08016183A mov esi, r14d ; R14=00000012 FFFF82D08016183D shl rsi, 0C ; RSI=00012000 FFFF82D080161841 mov rdx, FFFF820040000000 FFFF82D08016184B add rsi, rdx ; RSI=FFFF820040012000 FFFF82D08016184E shl rsi, 10 ; RSI=8200400120000000 FFFF82D080161852 shr rsi, 19 ; RSI=4100200090 FFFF82D080161856 mov rdx, 000FFFFFFFFFF000 FFFF82D080161860 mov rcx, FFFF810000000000 ; LINEAR_PT_VIRT_START FFFF82D08016186A and rdx, [rsi+rcx] ; RSI=4100200090 RCX=ffff810000000000 -> ffff814100200090 FFFF82D08016186E shr rdx, 0C FFFF82D080161872 cmp rax, rdx ; RAX=00244dbd RDX=f820060006 ; dcache->garbage = FFFF820060006000 FFFF82D080161875 je FFFF82D080161AF5 FFFF82D08016187B *** ud2 meaning that we found that something copied dcache->garbage (a linear address) into __linear_l1_table[]. Since there's only a single l1e_write() in domain_page.c that writes other than l1e_empty(), and since that code (looking at the disassembly) clearly doesn't use anything but the passed in value, I cannot in any way see how this would be happening. Yet with the value being one only ever used in domain_page.c, it's almost 100% certain that it's the code here that does something wrong under some specific condition. The first crash, being on a different CPU, does an unmap for the exact same MFN that the mapping is being done for above, but - due to being on a different CPU - necessarily uses a different entry and hence a different slot in the linear L1 table. With _both_ being corrupted, there must have been more than a single bogus write earlier on. The only debugging I see possible right now would be to sanity check the whole involved linear L1 table range both on entry and exit to/from {,un}map_domain_page(). But that would likely have a sever performance impact, possibly hiding the problem... Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.