Xen project Mailing List

Re: [Xen-devel] RFC: HVM de-privileged mode scheduling considerations

To: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>

From: Ben Catterall <Ben.Catterall@xxxxxxxxxx>

Date: Tue, 11 Aug 2015 11:40:02 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Tue, 11 Aug 2015 10:40:21 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 04/08/15 14:46, George Dunlap wrote:

On Mon, Aug 3, 2015 at 3:34 PM, Ian Campbell <ian.campbell@xxxxxxxxxx> wrote:

On Mon, 2015-08-03 at 14:54 +0100, Andrew Cooper wrote:

On 03/08/15 14:35, Ben Catterall wrote:

Hi all,

I am working on an x86 proof-of-concept to evaluate if it is feasible
to move device models and x86 emulation code for HVM guests into a
de-privileged context.

I was hoping to get feedback from relevant maintainers on scheduling
considerations for this system to mitigate potential DoS attacks.

Many thanks in advance,
Ben

This is intended as a proof-of-concept, with the aim of determining if
this idea is feasible within performance constraints.

Motivation
----------
The motivation for moving the device models and x86 emulation code
into ring 3 is to mitigate a system  compromise due a bug in any of
these systems. These systems are currently part of the hypervisor and,
consequently, a bug in any of these could allow an attacker to gain
control (or perform a DOS) of
Xen and/or guests.

Migrating between PCPUs
-----------------------
There is a need to support migration between pcpus so that the
scheduler can still perform this operation. However, there is an issue
to resolve. Currently, I have a per-vcpu copy of the Xen ring 0 stack
up to the point of entering the de-privileged mode. This allows us to
restore this stack and then continue from the entry point when we have
finished in de-privileged mode. There will be per-pcpu data on these
per-vcpu stacks such as saved stack frame pointers for the per-pcpu
stack, smp_processor_id() responses etc.

Therefore, it will be necessary to lock the vcpu to the current pcpu
when it enters this user mode so that it does not wake up on a
different pcpu where such pointers and other data are invalid. We can
do this by setting a hard affinity to the pcpu that the vcpu is
executing on. See common/wait.c which does something similar to what I
am doing.

However, needing to have hard affinity to a pcpu leads to the
following problem:
- An attacker could lock multiple vcpus to a single pcpu, leading to a
DoS. This could be achieved by  spinning in a loop in Xen
de-privileged mode (assuming a bug in this mode) and performing this
operation on multiple vcpus at once. The attacker could wait until all
of their vcpus were on the same pcpu and then execute this attack.
This could cause the pcpu to, effectively, lock up, as it will be
under heavy load, and we would be unable to move work elsewhere.

A solution to the DoS would be to force migration to another pcpu, if
after, say, 100 quanta have passed where the vcpu has remained in
de-privileged mode. This forcing of migration would require us to
forcibly complete the de-privileged operation, and then, just before
returning into the guest, force a cpu change. We could not just force
a migration at the schedule call point as the Xen stack needs to
unwind to free up resources. We would reset this count each time we
completed a de-privileged mode operation.

A legitimate long-running de-privileged operation would trigger this
forced migration mechanism. However, it is unlikely that such
operations will be needed and the count can be adjusted appropriately
to mitigate this.

Any suggestions or feedback would be appreciated!


I don't see why any scheduling support is needed.

Currently all operations like this are run synchronously in the vmexit
context of the vcpu.  Any current DoS is already a real issue.


The point is that this work is supposed to mitigate (or eliminate) such
issues, so we would like to remove this existing real issue.

IOW while it might be expected that an in-Xen DM can DoS the system, an in
-Xen-ring3 DM should not be able to do so.

In any reasonable situation, emulation of a device is a small state
mutation and occasionally kicking off a further action to perform.  (The
far bigger risk from this kind of emulation is following bad
pointers/etc, rather than long loops.)

I think it would be entirely reasonable to have a deadline for a single
execution of depriv mode, after which the domain is declared malicious
and killed.


I think this could make sense, it's essentially a harsher variant of Ben's
suggestion to abort an attempt to process the MMIO in order to migrate to
another pcpu, but it has the benefit of being easier to implement and
easier to reason about in terms of interactions with other aspects of the
system (i.e. it seems to remove the need to think of ways an attacker might
game that other system).

We already have this for host pcpus - the watchdog defaults to 5
seconds.  Having a similar cutoff for depriv mode should be fine.


That's a reasonable analogy.

Perhaps we would want the depriv-watchdog to be some 1/N fraction of the
pcpu -watchdog, for a smallish N, to avoid the risk of any slop in the
timing allowing the pcpu watchdog to fire. N=3 for example (on the grounds
that N=2 is probably sufficient, so N=3 must be awesome).


+1

  -George

Thanks all! I'll do this then. Appreciate the feedback! Ben _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.