[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
I am troubleshooting an issue where the Linux kernel tries to dereference a not present entry. I have a fix for this in for-2.6.32/bug-fixes .. but please read on. Specifically it tries to derefence the fixmapped value of APIC_BASE. The fixmapped value of APIC_BASE is actually not set due to git commit a1d8e2fa8325064338b2da1bcf0d7a0473883c284 which adds this in arch/x86/kernel/acpi/boot.c: static void __init acpi_register_lapic_address(unsigned long address) { /* Xen dom0 doesn't have usable lapics */ if (xen_initial_domain()) return; mp_lapic_addr = address; set_fixmap_nocache(FIX_APIC_BASE, address); Later on we use 'native_apic_read' which tries to use the APIC_BASE as address (it is present to be @ slot FIX_APIC_BASE of the fixmap API) and it fails (on some machines). Since we don't call 'set_fixmap_nocache(FIX_APIC_BASE)' and if one were to go through the pagetable this is what we get: [ 0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs [ 0.000000] mapped APIC to ffffffffff5fb000 (00000000) (XEN) d0:v0: unhandled page fault (ec=0000) (XEN) Pagetable walk from ffffffffff5fb020: (XEN) L4[0x1ff] = 0000000221003067 0000000000001003 (XEN) L3[0x1ff] = 0000000221004067 0000000000001004 (XEN) L2[0x1fa] = 0000000221771067 0000000000001771 (XEN) L1[0x1fb] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S (XEN) Domain 0 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-4.1-110309 x86_64 debug=y Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e033:[<ffffffff8102b5d1>] (XEN) RFLAGS: 0000000000000292 EM: 1 CONTEXT: pv guest (XEN) rax: ffffffff8164cf50 rbx: 000000026ec00000 rcx: 00000000ffffdd85 (XEN) rdx: 00000000ffffffff rsi: 0000000000000000 rdi: 0000000000000020 (XEN) rbp: ffffffff81643ea8 rsp: ffffffff81643e50 r8: 0000000000000002 (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 (XEN) r12: ffff880013671800 r13: 00000000bff66000 r14: ffffffffffffffff (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000006f0 (XEN) cr3: 0000000221001000 cr2: ffffffffff5fb020 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=ffffffff81643e50: Which is to say that the L1 has this: 0000000115771fa0: 00000000 00000000 00000000 00000000 0000000115771fb0: 00000000 00000000 00000000 00000000 0000000115771fc0: 00000000 00000000 15770067 80100001 0000000115771fd0: 15770067 80100001 00000000 00000000 0000000115771fe0: 00000000 00000000 00000000 00000000 0000000115771ff0: 00000000 00000000 00000000 00000000 L1[0x1fb] is machine address 115771fd8, which has nothing in it. OK, so I've come up a fix that is a back-port of how 2.6.38 does it which is that it removes the check I mentioned above and in xen_set_fixmap we set the FIX_APIC_BASE to actually point to a dummy ioapic_mapping. It is 7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c in for-2.6.32/bug-fixes Gianni, you might want to check this out in case it fixes the problem you are experiencing. But one thing I can't understand is why on one machine (IBM x3850) I get this crash, while another one with the same pagetable contents (L1 has nothing for 0x1fb) it works just fine? I added a panic and used the Xen hypervisor kdb to manually inspect the pagetable, and it has the same contents as the IBM x3850 -but it boots fine with this invalid value. Any ideas? FYI, seems another user (Sven Sübert) IBM x3650 hits the same bug. And with this fix he is able to boot. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |