|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 22/22] x86/mm: zero stack on stack switch or reset
On Mon, Jul 29, 2024 at 04:40:24PM +0100, Andrew Cooper wrote:
> On 26/07/2024 4:22 pm, Roger Pau Monne wrote:
> > With the stack mapped on a per-CPU basis there's no risk of other CPUs being
> > able to read the stack contents, but vCPUs running on the current pCPU could
> > read stack rubble from operations of previous vCPUs.
> >
> > The #DF stack is not zeroed because handling of #DF results in a panic.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > ---
> > xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
> > 1 file changed, 29 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/x86/include/asm/current.h
> > b/xen/arch/x86/include/asm/current.h
> > index 75b9a341f814..02b4118b03ef 100644
> > --- a/xen/arch/x86/include/asm/current.h
> > +++ b/xen/arch/x86/include/asm/current.h
> > @@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> > # define SHADOW_STACK_WORK ""
> > #endif
> >
> > +#define ZERO_STACK \
> > + "test %[stk_size], %[stk_size];" \
> > + "jz .L_skip_zeroing.%=;" \
> > + "std;" \
> > + "rep stosb;" \
> > + "cld;" \
> > + ".L_skip_zeroing.%=:"
> > +
> > #if __GNUC__ >= 9
> > # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn,
> > __noreturn__)
> > #else
> > @@ -187,10 +195,24 @@ unsigned long get_stack_dump_bottom (unsigned long
> > sp);
> > #define switch_stack_and_jump(fn, instr, constr) \
> > ({ \
> > unsigned int tmp; \
> > + bool zero_stack = current->domain->arch.asi; \
> > BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn)); \
> > + ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() - \
> > + PRIMARY_STACK_SIZE + \
> > + sizeof(struct cpu_info), PAGE_SIZE)); \
> > + if ( zero_stack ) \
> > + { \
> > + unsigned long stack_top = get_stack_bottom() & \
> > + ~(STACK_SIZE - 1); \
> > + \
> > + clear_page((void *)stack_top + IST_MCE * PAGE_SIZE); \
> > + clear_page((void *)stack_top + IST_NMI * PAGE_SIZE); \
> > + clear_page((void *)stack_top + IST_DB * PAGE_SIZE); \
> > + } \
> > __asm__ __volatile__ ( \
> > SHADOW_STACK_WORK \
> > "mov %[stk], %%rsp;" \
> > + ZERO_STACK \
> > CHECK_FOR_LIVEPATCH_WORK \
> > instr "[fun]" \
> > : [val] "=&r" (tmp), \
> > @@ -201,7 +223,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> > ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8), \
> > [stack_mask] "i" (STACK_SIZE - 1), \
> > _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__, \
> > - __FILE__, NULL) \
> > + __FILE__, NULL), \
> > + /* For stack zeroing. */ \
> > + "D" ((void *)guest_cpu_user_regs() - 1), \
> > + [stk_size] "c" \
> > + (zero_stack ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\
> > + : 0), \
> > + "a" (0) \
> > : "memory" ); \
> > unreachable(); \
> > })
>
> This looks very expensive.
>
> For starters, switch_stack_and_jump() is used twice in a typical context
> switch; once in the schedule tail, and again out of hvm_do_resume().
Right, it's the reset_stack_and_call_ind() at the end of context
switch and then the reset_stack_and_jump() in the HVM tail context
switch handlers.
One option would be to only do the stack zeroing from the
reset_stack_and_call_ind() call in context_switch().
I've got no idea how expensive this is, I might try to run some
benchmarks to get some figures. I was planning on running two VMs
with 1 vCPU each, both pinned to the same pCPU.
>
> Furthermore, #MC happen never (to many many significant figures), #DB
> happens never for HVM guests (but does happen for PV), and NMIs are
> either ~never, or 2Hz which is far less often than the 30ms default
> timeslice.
>
> So, the overwhelming majority of the time, those 3 calls to clear_page()
> will be re-zeroing blocks of zeroes.
>
> This can probably be avoided by making use of ist_exit (held in %r12) to
> only zero an IST stack when leaving it. This leaves the IRET frame able
> to be recovered, but with e.g. RFDS, you can do that irrespective, and
> it's not terribly sensitive.
I could look into that, TBH I was bordeline with clearing the IST
stacks, as I wasn't convinced there could be anything sensitive there,
but again couldn't convince myself there's nothing sensitive now,
nor can be in the future.
> What about shadow stacks? You're not zeroing those, and while they're
> less sensitive than the data stack, there ought to be some reasoning
> about them.
I've assumed that shadow stacks only contained the expected return
addresses, and hence won't be considered sensitive information, but
maybe I was too lax.
An attacker could get execution traces of the previous vCPU, and that
might be useful for some exploits?
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |