|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Possible issue with x86_emulate when writing results back to memory
We've been seeing very infrequent crashes in Windows VMs for a while where it
appears that the top-byte in a longword that is supposed to hold a pointer is
being set to zero - for example:
BUG CHECK 0000000A (00B49A40 00000002 00000001 8261092E)
The first parameter to keBugCheck is the faulting address - 00B49A40 in this
case.
When we look at the dump, we are in a routine releasing a queued spinlock and
the correct address that should have been used was 0xA3B49A40 and indeed memory
contents in the windows dump have this value. Looking around some more, we see
that the failing processor is executing the release queued spinlock code and
another processor is executing the code to acquire the same queued spinlock and
has recently written the 0xA3B49A40 value to the location where the failing
instruction stream read it from.
If we look at the disassembly for the two code paths, the writing code does:
mov dword ptr [edx],eax
and the reading code does the following to read this same value:
mov ebx,dword ptr [eax]
On a hunch that this might be a problem with the x86_emulate code, I took a
look at how the mov instruction would be emulated - in both cases where
emulation can be done (hvm/emulate.c and mm/shadow/multi.c), the routines that
write instruction results back to memory use memcpy() to actually copy the
data. Looking at the implementation of memcpy in Xen, I see that, in a 64-bit
build as ours is, it will use 'rep movsq' to move the data in quadwords and
then uses 'rep movsb' to move the last 1-7 bytes -- so the instructions above
will, I think, always use byte instructions for the four bytes.
Now, according to the X86 arch, 32-bit mov's are supposed to be atomic but
based on the above they will not be and I am speculating that this is the cause
of our occasional crash - the code path unlocking the spin lock on the other
processor sees a partially written value in memory.
Does this seem a reasonable explanation of the issue?
On the assumption that this is correct, I developed the attached patch (against
4.3.1) which updates all the code paths that are used to read and writeback the
results of instruction emulation to use a simple assignment if the length is 2
or 4 bytes -- this doesn't fix the general case where you have a length > 8 but
it does handle emulation of MOV instructions. Unfortunately, the use of
emulation in the HVM code uses a generic routine for copying memory to the
guest so every single place that guest memory is read or written will pay the
penalty of the extra check for length - not sure if that is terrible or not.
Since doing this we have not seen a single instance of the crash - but it's
only been a month!
The attached patch is for discussion purposes only - if it is deemed acceptable
I'll resubmit a proper patch request against unstable.
Simon Graham
Citrix Systems, Inc
Attachment:
fix-memcpy-in-x86-emulate _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |