|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Assertion 'l1e_get_pfn(MAPCACHE_L1ENT(hashent->idx)) == hashent->mfn' failed at domain_page.c:203
>>> On 02.12.13 at 21:33, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote:
> (XEN) ----[ Xen-4.4-unstable x86_64 debug=y Not tainted ]----
> (XEN) CPU: 6
> (XEN) RIP: e008:[<ffff82d08016187b>] map_domain_page+0x1fb/0x4af
> (XEN) RFLAGS: 0000000000010087 CONTEXT: hypervisor
> (XEN) rax: 0000000000244dbd rbx: ffff83042cb59000 rcx: ffff810000000000
> (XEN) rdx: 000000f820060006 rsi: 0000004100200090 rdi: 0000000000000000
> (XEN) rbp: ffff83042cb67db8 rsp: ffff83042cb67d78 r8: 00000000deadbeef
> (XEN) r9: 00000000deadbeef r10: ffff82d08023d160 r11: 0000000000000246
> (XEN) r12: ffff8300ba712000 r13: 0000000000244dbd r14: 0000000000000012
> (XEN) r15: 0000000000000005 cr0: 0000000080050033 cr4: 00000000000406f0
> (XEN) cr3: 00000002e03c2000 cr2: 000000370d4de180
> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
> (XEN) Xen stack trace from rsp=ffff83042cb67d78:
> (XEN) 0000000000000f2a 0000000000000286 0000003c6c3d8dea 0000000000244dbd
> (XEN) ffff82e00489b7a0 0000000000000000 ffff880026625c60 0000000000000000
> (XEN) ffff83042cb67ef8 ffff82d08017b69f ffff83042cb67dd8 ffff82d08015cc0b
> (XEN) ffff83042cb67e38 ffff82d080160a8b 0000000000000000 0000000000000000
> (XEN) 0000000000000000 ffff83042cb67ea8 0000000000000000 0000000000244dbd
> (XEN) ffff8300ba712000 0000000000000000 0000000000000000 ffff820040069240
> (XEN) 00007ff000000000 0000000000000000 ffff82e00489b7a0 ffff83042cb59000
> (XEN) ffff83042cb67eb8 ffff83042cb60000 ffff83042cb60000 0000000500000000
> (XEN) ffff83042cb59000 ffff8300ba712000 ffff83042cb59000 0000000500000001
> (XEN) ffff83042cb67f08 0000000000000000 ffff83042cb67f18 00000000ba712000
> (XEN) 0000000244dbd6f0 0000000417a0e025 ffff83042cb67f08 ffff8300ba712000
> (XEN) ffff88011a98f6f0 0000000417a0e025 0000000000000000 0000000417a0e025
> (XEN) 00007cfbd34980c7 ffff82d0802248db ffffffff8100102a 0000000000000001
> (XEN) 0000000001e097f8 0000000001dc2010 0000000001dc77e0 0000000000000000
> (XEN) ffff880026625c98 00000000000006f0 0000000000000246 0000000000007ff0
> (XEN) ffffea00044a41dc 0000000000000000 0000000000000001 ffffffff8100102a
> (XEN) 0000000000000000 0000000000000001 ffff880026625c60 0001010000000000
> (XEN) ffffffff8100102a 000000000000e033 0000000000000246 ffff880026625c48
> (XEN) 000000000000e02b ffffffffffffbeef ffffffffffffbeef ffffffffffffbeef
> (XEN) ffffffffffffbeef ffffffff00000006 ffff8300ba712000 00000033ac85d080
> (XEN) Xen call trace:
> (XEN) [<ffff82d08016187b>] map_domain_page+0x1fb/0x4af
> (XEN) [<ffff82d08017b69f>] do_mmu_update+0x6cb/0x19aa
> (XEN) [<ffff82d0802248db>] syscall_enter+0xeb/0x145
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 6:
> (XEN) Assertion 'l1e_get_pfn(MAPCACHE_L1ENT(idx)) == mfn' failed at
> domain_page.c:94
> (XEN) ****************************************
This second one provided more information than the first one,
and makes clear that the assertion indeed caught some (earlier)
corruption. The relevant piece of code from map_domain_page()
is (with actual value annotations)
FFFF82D08016183A mov esi, r14d ; R14=00000012
FFFF82D08016183D shl rsi, 0C ; RSI=00012000
FFFF82D080161841 mov rdx, FFFF820040000000
FFFF82D08016184B add rsi, rdx ; RSI=FFFF820040012000
FFFF82D08016184E shl rsi, 10 ; RSI=8200400120000000
FFFF82D080161852 shr rsi, 19 ; RSI=4100200090
FFFF82D080161856 mov rdx, 000FFFFFFFFFF000
FFFF82D080161860 mov rcx, FFFF810000000000 ; LINEAR_PT_VIRT_START
FFFF82D08016186A and rdx, [rsi+rcx] ; RSI=4100200090
RCX=ffff810000000000 -> ffff814100200090
FFFF82D08016186E shr rdx, 0C
FFFF82D080161872 cmp rax, rdx ; RAX=00244dbd
RDX=f820060006
; dcache->garbage =
FFFF820060006000
FFFF82D080161875 je FFFF82D080161AF5
FFFF82D08016187B *** ud2
meaning that we found that something copied dcache->garbage
(a linear address) into __linear_l1_table[]. Since there's only a
single l1e_write() in domain_page.c that writes other than
l1e_empty(), and since that code (looking at the disassembly)
clearly doesn't use anything but the passed in value, I cannot
in any way see how this would be happening. Yet with the value
being one only ever used in domain_page.c, it's almost 100%
certain that it's the code here that does something wrong under
some specific condition.
The first crash, being on a different CPU, does an unmap for
the exact same MFN that the mapping is being done for above,
but - due to being on a different CPU - necessarily uses a
different entry and hence a different slot in the linear L1
table. With _both_ being corrupted, there must have been
more than a single bogus write earlier on.
The only debugging I see possible right now would be to
sanity check the whole involved linear L1 table range both on
entry and exit to/from {,un}map_domain_page(). But that
would likely have a sever performance impact, possibly hiding
the problem...
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |