[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 for-4.14] x86/monitor: revert default behavior when monitoring register write events



On Tue, Jun 02, 2020 at 07:49:09AM -0600, Tamas K Lengyel wrote:
> For the last couple years we have received numerous reports from users of
> monitor vm_events of spurious guest crashes when using events. In particular,
> it has observed that the problem occurs when vm_events are being disabled. The
> nature of the guest crash varied widely and has only occured occasionally. 
> This
> made debugging the issue particularly hard. We had discussions about this 
> issue
> even here on the xen-devel mailinglist with no luck figuring it out.
> 
> The bug has now been identified as a race-condition between register event
> handling and disabling the monitor vm_event interface.
> 
> Patch 96760e2fba100d694300a81baddb5740e0f8c0ee, "vm_event: deny register 
> writes
> if refused by  vm_event reply" is the patch that introduced the error. In this

FWIW, we use the 'Fixes:' tag in order to make it easier for
maintainers of stable trees to know which bugfixes to pick. This
should have:

Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event 
reply')

Before the SoB.

> patch the default behavior regarding emulation of register write events is
> changed so that they get postponed until the corresponding vm_event handler
> decides whether to allow such write to take place. Unfortunately this can only
> be implemented by performing the deny/allow step when the vCPU gets scheduled.
> Due to that postponed emulation of the event if the user decides to pause the
> VM in the vm_event handler and then disable events, the entire emulation step
> is skipped the next time the vCPU is resumed. Even if the user doesn't pause
> during the vm_event handling but exits immediately and disables vm_event, the
> situation becomes racey as disabling vm_event may succeed before the guest's
> vCPUs get scheduled with the pending emulation task. This has been 
> particularly
> the case with VMS that have several vCPUs as after the VM is unpaused it may
> actually take a long time before all vCPUs get scheduled.
> 
> In this patch we are reverting the default behavior to always perform 
> emulation
> of register write events when the event occurs. To postpone them can be turned
> on as an option. In that case the user of the interface still has to take care
> of only disabling the interface when its safe as it remains buggy.
> 
> Signed-off-by: Tamas K Lengyel <tamas@xxxxxxxxxxxxx>

Thanks for taking care of this!

> ---
>  xen/arch/x86/hvm/hvm.c            | 14 ++++++++------
>  xen/arch/x86/hvm/monitor.c        | 13 ++++++++-----
>  xen/arch/x86/monitor.c            | 10 +++++++++-
>  xen/include/asm-x86/domain.h      |  1 +
>  xen/include/asm-x86/hvm/monitor.h |  7 +++----
>  xen/include/public/domctl.h       |  1 +
>  6 files changed, 30 insertions(+), 16 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 74c9f84462..5bb47583b3 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -3601,13 +3601,15 @@ int hvm_msr_write_intercept(unsigned int msr, 
> uint64_t msr_content,
>  
>          ASSERT(v->arch.vm_event);
>  
> -        /* The actual write will occur in hvm_do_resume() (if permitted). */
> -        v->arch.vm_event->write_data.do_write.msr = 1;
> -        v->arch.vm_event->write_data.msr = msr;
> -        v->arch.vm_event->write_data.value = msr_content;
> +        if ( hvm_monitor_msr(msr, msr_content, msr_old_content) )
> +        {
> +            /* The actual write will occur in hvm_do_resume(), if permitted. 
> */
> +            v->arch.vm_event->write_data.do_write.msr = 1;
> +            v->arch.vm_event->write_data.msr = msr;
> +            v->arch.vm_event->write_data.value = msr_content;
>  
> -        hvm_monitor_msr(msr, msr_content, msr_old_content);
> -        return X86EMUL_OKAY;
> +            return X86EMUL_OKAY;
> +        }
>      }
>  
>      if ( (ret = guest_wrmsr(v, msr, msr_content)) != X86EMUL_UNHANDLEABLE )
> diff --git a/xen/arch/x86/hvm/monitor.c b/xen/arch/x86/hvm/monitor.c
> index 8aa14137e2..36894b33a4 100644
> --- a/xen/arch/x86/hvm/monitor.c
> +++ b/xen/arch/x86/hvm/monitor.c
> @@ -53,11 +53,11 @@ bool hvm_monitor_cr(unsigned int index, unsigned long 
> value, unsigned long old)
>              .u.write_ctrlreg.old_value = old
>          };
>  
> -        if ( monitor_traps(curr, sync, &req) >= 0 )
> -            return 1;
> +        return monitor_traps(curr, sync, &req) >= 0 &&
> +            curr->domain->arch.monitor.control_register_values;

Nit (it can be fixed while committing IMO): curr should be aligned to
monitor.

>      }
>  
> -    return 0;
> +    return false;
>  }
>  
>  bool hvm_monitor_emul_unimplemented(void)
> @@ -77,7 +77,7 @@ bool hvm_monitor_emul_unimplemented(void)
>          monitor_traps(curr, true, &req) == 1;
>  }
>  
> -void hvm_monitor_msr(unsigned int msr, uint64_t new_value, uint64_t 
> old_value)
> +bool hvm_monitor_msr(unsigned int msr, uint64_t new_value, uint64_t 
> old_value)
>  {
>      struct vcpu *curr = current;
>  
> @@ -92,8 +92,11 @@ void hvm_monitor_msr(unsigned int msr, uint64_t new_value, 
> uint64_t old_value)
>              .u.mov_to_msr.old_value = old_value
>          };
>  
> -        monitor_traps(curr, 1, &req);
> +        return monitor_traps(curr, 1, &req) >= 0 &&
> +            curr->domain->arch.monitor.control_register_values;

Same here.

>      }
> +
> +    return false;
>  }
>  
>  void hvm_monitor_descriptor_access(uint64_t exit_info,
> diff --git a/xen/arch/x86/monitor.c b/xen/arch/x86/monitor.c
> index bbcb7536c7..1517a97f50 100644
> --- a/xen/arch/x86/monitor.c
> +++ b/xen/arch/x86/monitor.c
> @@ -144,7 +144,15 @@ int arch_monitor_domctl_event(struct domain *d,
>                                struct xen_domctl_monitor_op *mop)
>  {
>      struct arch_domain *ad = &d->arch;
> -    bool requested_status = (XEN_DOMCTL_MONITOR_OP_ENABLE == mop->op);
> +    bool requested_status;
> +
> +    if ( XEN_DOMCTL_MONITOR_OP_CONTROL_REGISTERS == mop->op )
> +    {
> +        ad->monitor.control_register_values = true;
> +        return 0;

I think this would be better implemented in arch_monitor_domctl_op
which already handles other XEN_DOMCTL_MONITOR_OP_* options, and also
skips the arch_monitor_domctl_event call?

> +    }
> +
> +    requested_status = (XEN_DOMCTL_MONITOR_OP_ENABLE == mop->op);
>  
>      switch ( mop->event )
>      {
> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
> index e8cee3d5e5..6fd94c2e14 100644
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -418,6 +418,7 @@ struct arch_domain
>           * This is used to filter out pagefaults.
>           */
>          unsigned int inguest_pagefault_disabled                            : 
> 1;
> +        unsigned int control_register_values                               : 
> 1;
>          struct monitor_msr_bitmap *msr_bitmap;
>          uint64_t write_ctrlreg_mask[4];
>      } monitor;
> diff --git a/xen/include/asm-x86/hvm/monitor.h 
> b/xen/include/asm-x86/hvm/monitor.h
> index 66de24cb75..a75cd8545c 100644
> --- a/xen/include/asm-x86/hvm/monitor.h
> +++ b/xen/include/asm-x86/hvm/monitor.h
> @@ -29,15 +29,14 @@ enum hvm_monitor_debug_type
>  };
>  
>  /*
> - * Called for current VCPU on crX/MSR changes by guest.
> - * The event might not fire if the client has subscribed to it in 
> onchangeonly
> - * mode, hence the bool return type for control register write events.
> + * Called for current VCPU on crX/MSR changes by guest. Bool return signals
> + * whether emulation should be postponed.
>   */
>  bool hvm_monitor_cr(unsigned int index, unsigned long value,
>                      unsigned long old);
>  #define hvm_monitor_crX(cr, new, old) \
>                          hvm_monitor_cr(VM_EVENT_X86_##cr, new, old)
> -void hvm_monitor_msr(unsigned int msr, uint64_t value, uint64_t old_value);
> +bool hvm_monitor_msr(unsigned int msr, uint64_t value, uint64_t old_value);
>  void hvm_monitor_descriptor_access(uint64_t exit_info,
>                                     uint64_t vmx_exit_qualification,
>                                     uint8_t descriptor, bool is_write);
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> index 1ad34c35eb..cbcd25f12c 100644
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -1025,6 +1025,7 @@ struct xen_domctl_psr_cmt_op {
>  #define XEN_DOMCTL_MONITOR_OP_DISABLE           1
>  #define XEN_DOMCTL_MONITOR_OP_GET_CAPABILITIES  2
>  #define XEN_DOMCTL_MONITOR_OP_EMULATE_EACH_REP  3
> +#define XEN_DOMCTL_MONITOR_OP_CONTROL_REGISTERS 4

Could you please add a note that this is broken?

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.