[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-4.3 and -unstable regression from changeset "numa-sched: leave node-affinity alone if not in 'auto' mode"



On 02/12/13 14:36, Jan Beulich wrote:
>>>> On 02.12.13 at 15:01, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>> After some more investigation, this is not a regression at all, although
>> the patch is directly relevant to identifying the problem.
>>
>> PXELINUX 4.04 2011-04-18  Copyright (C) 1994-2011 H. Peter Anvin et al
>> boot:
>> Loading xenrt/xen-minnow.gz... ok
>> Loading xenrt/vmlinuz... ok
>> After multiboot magic check
>> Opcode from 0x105fef: 97 0e 00 00 49 8d be b0
>> Before lret into trampoline
>> Opcode from 0x105fef: 97 0e 00 00 49 8d be b0
>> After (failed) conditional jmp to start_secondary
>> Opcode from 0xffff830000105fef: 97 0e 86 00 49 8d be b0
>>  __  __            _  _    _____  _
>>  \ \/ /___ _ __   | || |  |___ / / |
>>   \  // _ \ '_ \  | || |_   |_ \ | |
>>
>>
>> Something between entering the trampoline and emerging in 64bit mode is
>> corrupting a single byte at phys 0x105ff1 from its correct value to a
>> value of 0x86.
>>
>> The corruption disappears if the "no-real-mode" is used.
> And I'd say the primary suspect is
>
>         /*
>          * Declare that our target operating mode is long mode.
>          * Initialise 32-bit registers since some buggy BIOSes depend on it.
>          */
>         movl    $0xec00,%eax      # declare target operating mode
>         movl    $0x0002,%ebx      # long mode
>         int     $0x15
>
> considering that 0x86 is a relatively common "function not
> implemented" indicator for BIOS, namely INT 15, functions.
>
> As a possible workaround I'd consider trying
> a) zeroing %esp rather than just %sp a few lines up from the
> above quoted code
> b) zeroing the high halves of all registers
>
> Jan
>

Your suspicion would be entirely correct.  I have positively identified
this `int $0x15` call as corrupting the memory.  The byte is fine
immediately before and bad immediately afterwards.

I have further confirmed that zeroing all 32bits of the GPRs before
entering the interrupt fixes the issue.

In an attempt to understand what is going on, I stuck in more debugging
for the entire register/selector state before and after, to see whether
anything looked like a smoking gun.

(XEN) Pre-state:
(XEN) eax 00007600 ebx 00000000 ecx 00000000 edx 00007600
(XEN) esi 0028b0c4 edi 00078a80 esp 00080000 ebp 00000000
(XEN) cs 7600 ds 7600 es 7600 fs 0028 gs 0028 ss 7600

If the GPRs are left as are the post state looks like:

(XEN) Post-state:
(XEN) eax 00008600 ebx 00000000 ecx 00000000 edx 00007600
(XEN) esi 0028b0c4 edi 00078a70 esp 00080000 ebp 00000000
(XEN) cs 7600 ds 7600 es 7600 fs 0028 gs 0028 ss 7600

If the GPRs are zeroed as much as possible, the post state looks like:

(XEN) Post-state:
(XEN) eax 00008600 ebx 00000000 ecx 00000000 edx 00000000
(XEN) esi 00000000 edi 00000000 esp 00000000 ebp 00000000
(XEN) cs 7600 ds 7600 es 7600 fs 0028 gs 0028 ss 7600

In both cases, the carry flag is set, which is consistent with the
return value of 0x86 is %ah.


I iterated through the registers, and proved that it was esp
specifically which was the problem.

I shall submit a patch against trampoline.S shortly.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.