[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH V4 15/24] xen/arm: Stick around in leave_hypervisor_to_guest until I/O has completed



Hi Oleksandr,

On 12/01/2021 21:52, Oleksandr Tyshchenko wrote:
From: Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>

This patch adds proper handling of return value of
vcpu_ioreq_handle_completion() which involves using a loop in
leave_hypervisor_to_guest().

The reason to use an unbounded loop here is the fact that vCPU shouldn't
continue until the I/O has completed.

The IOREQ code is using wait_on_xen_event_channel(). Yet, this can
still "exit" early if an event has been received. But this doesn't mean
the I/O has completed (in can be just a spurious wake-up).

While I agree we need the loop, I don't think the reason is correct here. If you receive a spurious event, then the loop in wait_for_io() will catch it.

The only way to get out of that loop is if the I/O has been handled or the state in the IOREQ page is invalid.

In addition to that, handle_hvm_io_completion(), will only return false if the state is invalid or there is vCPI work to do.

So we need
to check if the I/O has completed and wait again if it hasn't (we will
block the vCPU again until an event is received). This loop makes sure
that all the vCPU works are done before we return to the guest.

The call chain below:
check_for_vcpu_work -> vcpu_ioreq_handle_completion -> wait_for_io ->
wait_on_xen_event_channel

The worse that can happen here if the vCPU will never run again
(the I/O will never complete). But, in Xen case, if the I/O never
completes then it most likely means that something went horribly
wrong with the Device Emulator. And it is most likely not safe
to continue. So letting the vCPU to spin forever if the I/O never
completes is a safer action than letting it continue and leaving
the guest in unclear state and is the best what we can do for now.

Please note, using this loop we will not spin forever on a pCPU,
preventing any other vCPUs from being scheduled. At every loop
we will call check_for_pcpu_work() that will process pending
softirqs. In case of failure, the guest will crash and the vCPU
will be unscheduled. In normal case, if the rescheduling is necessary
(might be set by a timer or by a caller in check_for_vcpu_work(),
where wait_for_io() is a preemption point) the vCPU will be rescheduled
to give place to someone else.

What you describe here is a bug that was introduced by this series. If you think the code requires a separate patch, then please split off patch #14 so the code callling vcpu_ioreq_handle_completion() happen here.

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>


CC: Julien Grall <julien.grall@xxxxxxx>
[On Arm only]
Tested-by: Wei Chen <Wei.Chen@xxxxxxx>

---
Please note, this is a split/cleanup/hardening of Julien's PoC:
"Add support for Guest IO forwarding to a device emulator"

Changes V1 -> V2:
    - new patch, changes were derived from (+ new explanation):
      arm/ioreq: Introduce arch specific bits for IOREQ/DM features

Changes V2 -> V3:
    - update patch description

Changes V3 -> V4:
    - update patch description and comment in code
---
  xen/arch/arm/traps.c | 38 +++++++++++++++++++++++++++++++++-----
  1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 036b13f..4a83e1e 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -2257,18 +2257,23 @@ static void check_for_pcpu_work(void)
   * Process pending work for the vCPU. Any call should be fast or
   * implement preemption.
   */
-static void check_for_vcpu_work(void)
+static bool check_for_vcpu_work(void)
  {
      struct vcpu *v = current;
#ifdef CONFIG_IOREQ_SERVER
+    bool handled;
+
      local_irq_enable();
-    vcpu_ioreq_handle_completion(v);
+    handled = vcpu_ioreq_handle_completion(v);
      local_irq_disable();
+
+    if ( !handled )
+        return true;
  #endif
if ( likely(!v->arch.need_flush_to_ram) )
-        return;
+        return false;
/*
       * Give a chance for the pCPU to process work before handling the vCPU
@@ -2279,6 +2284,8 @@ static void check_for_vcpu_work(void)
      local_irq_enable();
      p2m_flush_vm(v);
      local_irq_disable();
+
+    return false;
  }
/*
@@ -2291,8 +2298,29 @@ void leave_hypervisor_to_guest(void)
  {
      local_irq_disable();
- check_for_vcpu_work();
-    check_for_pcpu_work();
+    /*
+     * The reason to use an unbounded loop here is the fact that vCPU
+     * shouldn't continue until the I/O has completed.
+     *
+     * The worse that can happen here if the vCPU will never run again
+     * (the I/O will never complete). But, in Xen case, if the I/O never
+     * completes then it most likely means that something went horribly
+     * wrong with the Device Emulator. And it is most likely not safe
+     * to continue. So letting the vCPU to spin forever if the I/O never
+     * completes is a safer action than letting it continue and leaving
+     * the guest in unclear state and is the best what we can do for now.
+     *
+     * Please note, using this loop we will not spin forever on a pCPU,
+     * preventing any other vCPUs from being scheduled. At every loop
+     * we will call check_for_pcpu_work() that will process pending
+     * softirqs. In case of failure, the guest will crash and the vCPU
+     * will be unscheduled. In normal case, if the rescheduling is necessary
+     * (might be set by a timer or by a caller in check_for_vcpu_work(),
+     * the vCPU will be rescheduled to give place to someone else.

TBH, I think this comment is a bit too much and sort of out of context because this describing the inner implementation of check_for_vcpu_work().

How about the following:

/*
 * check_for_vcpu_work() may return true if there are more work to
 * before the vCPU can safely resume. This gives us an opportunity
 * to deschedule the vCPU if needed.
 */

+     */
+    do {
+        check_for_pcpu_work();
+    } while ( check_for_vcpu_work() );

So there are two important changes in this new implementation:
1) Without CONFIG_IOREQ_SERVER=y, we will call check_for_pcpu_work() twice in a row when handling set/way. 2) After handling the pCPU work, we will now return to the guest directly. Before, we gave another opportunity for Xen to schedule a different work. This means, we may return to the vCPU for a very short time and will introduce more overhead.

So I would rework the loop to write it as:

while ( check_for_pcpu_work() )
   check_for_pcpu_work();
check_for_pcpu_work();

vgic_sync_to_lrs();

Cheers,

--
Julien Grall



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.