[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?



I am troubleshooting an issue where the Linux kernel tries
to dereference a not present entry. I have a fix for this
in for-2.6.32/bug-fixes .. but please read on.

Specifically it tries to derefence the fixmapped value of
APIC_BASE. The fixmapped value of APIC_BASE is actually not set
due to git commit a1d8e2fa8325064338b2da1bcf0d7a0473883c284
which adds this in arch/x86/kernel/acpi/boot.c:

static void __init acpi_register_lapic_address(unsigned long address)
 {
        /* Xen dom0 doesn't have usable lapics */
       if (xen_initial_domain())
             return;
 
        mp_lapic_addr = address;

        set_fixmap_nocache(FIX_APIC_BASE, address);

Later on we use 'native_apic_read' which tries to use the APIC_BASE as
address (it is present to be @ slot FIX_APIC_BASE of the fixmap
API) and it fails (on some machines).

Since we don't call 'set_fixmap_nocache(FIX_APIC_BASE)' and 
if one were to go through the pagetable this is what we get:


[    0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs
[    0.000000] mapped APIC to ffffffffff5fb000 (00000000)
(XEN) d0:v0: unhandled page fault (ec=0000)
(XEN) Pagetable walk from ffffffffff5fb020:
(XEN)  L4[0x1ff] = 0000000221003067 0000000000001003
(XEN)  L3[0x1ff] = 0000000221004067 0000000000001004
(XEN)  L2[0x1fa] = 0000000221771067 0000000000001771 
(XEN)  L1[0x1fb] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.1-110309  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e033:[<ffffffff8102b5d1>]
(XEN) RFLAGS: 0000000000000292   EM: 1   CONTEXT: pv guest
(XEN) rax: ffffffff8164cf50   rbx: 000000026ec00000   rcx: 00000000ffffdd85
(XEN) rdx: 00000000ffffffff   rsi: 0000000000000000   rdi: 0000000000000020
(XEN) rbp: ffffffff81643ea8   rsp: ffffffff81643e50   r8:  0000000000000002
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: ffff880013671800   r13: 00000000bff66000   r14: ffffffffffffffff
(XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 0000000221001000   cr2: ffffffffff5fb020
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=ffffffff81643e50:

Which is to say that the L1 has this:
0000000115771fa0:  00000000 00000000 00000000 00000000
0000000115771fb0:  00000000 00000000 00000000 00000000
0000000115771fc0:  00000000 00000000 15770067 80100001
0000000115771fd0:  15770067 80100001 00000000 00000000
0000000115771fe0:  00000000 00000000 00000000 00000000
0000000115771ff0:  00000000 00000000 00000000 00000000

L1[0x1fb] is machine address 115771fd8, which has nothing in it.

OK, so I've come up a fix that is a back-port of how 2.6.38 does it
which is that it removes the check I mentioned above and in xen_set_fixmap
we set the FIX_APIC_BASE to actually point to a dummy ioapic_mapping. 
It is 7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c in for-2.6.32/bug-fixes

Gianni, you might want to check this out in case it fixes the problem you
are experiencing.

But one thing I can't understand is why on one machine (IBM x3850)
I get this crash, while another one with the same pagetable contents
(L1 has nothing for 0x1fb) it works just fine? I added a panic and used
the Xen hypervisor kdb to manually inspect the pagetable, and it has
the same contents as the IBM x3850 -but it boots fine with this invalid value.
Any ideas?


FYI, seems another user (Sven Sübert) IBM x3650 hits the same bug. And with
this fix he is able to boot.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.