[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Issues with the xen-access.c model



On 04/05/2016 07:10 PM, Andrew Cooper wrote:
> On 05/04/16 16:55, Razvan Cojocaru wrote:
>> On 04/05/16 18:35, Tamas K Lengyel wrote:
>>>
>>> On Tue, Apr 5, 2016 at 6:13 AM, Andrew Cooper <andrew.cooper3@xxxxxxxxxx
>>> <mailto:andrew.cooper3@xxxxxxxxxx>> wrote:
>>>
>>>     On 05/04/16 11:55, Razvan Cojocaru wrote:
>>>     > Hello,
>>>     >
>>>     > xen-access.c does roughly this:
>>>     >
>>>     > for (;;) {
>>>     >       poll_evt_channel();
>>>     >
>>>     >       if (new_events) {
>>>     >               while (ring_buffer_has_requests) {
>>>     >                       pull_request();
>>>     >                       process_request();
>>>     >                       put_response();
>>>     >               }
>>>     >
>>>     >               signal_evt_channel();
>>>     >       }
>>>     > }
>>>     >
>>>     > The problems are these:
>>>     >
>>>     > 1. vm_event_put_request() does notify_via_xen_event_channel(d,
>>>     > ved->xen_port); at the very end, and we seem to be using a FIFO event
>>>     > channel (if I'm reading the code right, it's generic event channel
>>>     code
>>>     > that pre-dates the vm_event code that sits on top of it).
>>>     >
>>>     > This means that for a guest that's running two VCPUs, the following
>>>     > scenario is possible:
>>>     >
>>>     > * VCPU1 puts an event in the ring buffer
>>>     > * VCPU1 signals the event channel
>>>     > * poll_evt_channel() wakes up
>>>     > * VCPU2 puts an event in the ring buffer
>>>     > * the loop processes the first event
>>>     > * VCPU2 signals the event channel
>>>     > * the loop processes the second event (since there _are_ more requests
>>>     > in the ring buffer)
>>>     > * the loop signals the event channel that requests have been processed
>>>     > * poll_evt_channel() wakes up again, because of the second event
>>>     channel
>>>     > signal, but this is pointless since all events have already been
>>>     processed
>>>     >
>>>     > This can be avoided by counting the requests processed vs. the
>>>     number of
>>>     > succesful poll()s, but it still does not solve the second problem:
>>>     >
>>>     > 2. Race conditions. At least in theory, there's nothing to stop the
>>>     > second iteration of the request processing loop to read a partially
>>>     > written request from the ring buffer, or garbage left over from a
>>>     > previous request, if it hits it "just right" (i.e. the "VCPU2 puts an
>>>     > event in the ring buffer" part can run on top of the "the loop
>>>     processes
>>>     > the second event").
>>>     >
>>>     > All this doesn't really show up unless we're in heavy processing
>>>     > scenarios, but they can occur, especially in synced vm_event
>>>     scenarios.
>>>     >
>>>     > The problems seems to be alleviated by changing the loop to this:
>>>     >
>>>     > for (;;) {
>>>     >       poll_evt_channel();
>>>     >
>>>     >       if (new_events) {
>>>     >               while (ring_buffer_has_requests) {
>>>     >                       pull_request();
>>>     >                       process_request();
>>>     >                       put_response();
>>>     >
>>>     >                       /* Moved here. */
>>>     >                       signal_evt_channel();
>>>
>>>
>>> That was the original setup, I've changed it a while back by moving it
>>> outside the loop. My reasoning was that there is no point in going
>>> lockstep with events and issuing a hypercall after each is processed.
>>> With only a handful of events on the ring the context switch is probably
>>> more taxing then having all requests and responses processed in one
>>> swipe. Of course, depending on the application this may not be the
>>> optimal case and the signal can be moved back to the loop (with lots of
>>> vCPUs it may be the case that the first vCPU remains paused while all
>>> events are processed for the other vCPUs, which of course would be bad).
>>>  
>>>
>>>     >               }
>>>     >       }
>>>     > }
>>>     >
>>>     > but it's obviously not foolproof, the main problem remaining the
>>>     > synchronization of the ring buffer from both the hypervisor side and
>>>     > userspace.
>>>     >
>>>     > How should we proceed to fix this?
>>>
>>>     The vm_event code in Xen must increment the producer index as the final
>>>     action of putting a request on the ring.  If it doesn't then it is
>>>     simply broken and needs fixing.
>>>
>>>     This will ensure that the consumer never sees a partial request on the
>>>     ring; by the time it observes a newer producer index, all data is
>>>     already written.
>>>
>>>
>>> It may worth double-checking but I'm pretty sure that's how it works
>>> right now as that part of the code has been pretty much just been
>>> recycled from the other Xen ring implementations.
>> I understand. We shouldn't forget that the consumer also modifies the
>> ring in userspace, and the hypervisor then becomes a consumer of
>> responses, and the same issues apply to it (would compiler optimizations
>> be a problem here?).
> 
> Quite possibly.  See XSA-155
> 
>> But unless that's the case, then I'm probably seeing a different issue and 
>> Andrew is right.
>>
>> Thank you both for the prompt replies!
> 
> You need to make certain that for both producer/consumer pairs, the
> following is true:
> 
> The producer and consumer indices are read using some form of atomic
> operation (ACCESS_ONCE() most likely, or read_atomic()), followed by an
> smb_rmb() to guarantee that the read completes and is in a local
> variable before any calculation is performed based on the value.  (This
> prevents the compiler "optimising" to multiple reads of the shared ring).
> 
> Then, memcpy() (or suitable alternative) to get the contents of the
> request in/out of the ring.
> 
> For the producer side, an smp_wmb() between the memcpy() into the ring,
> and the write to the producer index.
> For the consumer side, an smp_mb() between the memcpy() out of the ring,
> and the write to the consumer index.

Did all of that, to no avail - but the issue may indeed be somewhere else.

What I'm seeing is that on Windows 7 x86 boot, at the point of loading
some drivers (most often around usbport.sys), if I get page faults in
quick succession and try to emulate the offending instruction with code
similar to the xen-access.c "outside-the-loop resume" model, the domU
appears to freeze (although it's still running and nothing crashes). x64
guests seem unaffected by this, and x86 guests almost never hang if I
put the resume code inside the loop.

This behaviour is always accompanied by qemu-dm going to 100% CPU for a
few minutes. It eventually subsides, but the guest remains unusable
although it's not dead.

This goes away if I just run the instruction (with a single-step-like
model) inside the hypervisor instead of trying to emulate it, this being
the only change.

Unfortunately this is not trivial to demonstrate with small
modifications to the Xen in-tree code (see the CLI emulation issue, for
example).

Would trying to emulate an instruction from a driver working with qemu
cause this? No emulation failures are reported, but the attempt
apparently leads to a(n IO?) loop of some kind.


Thanks,
Razvan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.