|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 3/3] xen/arm: stack check instrumentation
On 7/29/24 14:36, Julien Grall wrote:
> Hi Stewart,
>
> On 29/07/2024 15:24, Stewart Hildebrand wrote:
>> Use the -finstrument-functions option to check that the stack pointer is
>
> Is the feature supported by both clang and GCC?
Yes, I tested both.
> If so, which from versions?
gcc since at least 1998, not sure what version specifically.
clang since v2.8 (2010)
>
> From README, this is what we currently support.
>
> - For ARM 32-bit:
> - GCC 4.9 or later
> - GNU Binutils 2.24 or later
> - For ARM 64-bit:
> - GCC 5.1 or later
> - GNU Binutils 2.24 or later
>
> We don't mention Clang, but I would expect a clang from 4-5 years should
> still be Arm (not cross-compile as we never fully added the support).
There is a way cross compile with clang today, but still using gnu
linker and such. Here's my clang build rune:
make -j $(nproc) \
clang=y \
CC="clang --target=aarch64-none-linux-gnu -march=armv8a+nocrypto" \
CXX="clang++ --target=aarch64-none-linux-gnu -march=armv8a+nocrypto" \
HOSTCC=clang \
HOSTCXX=clang++ \
XEN_TARGET_ARCH=arm64 \
CROSS_COMPILE=aarch64-none-linux-gnu- \
dist-xen
>
>
>> valid upon each function entry. Only enable it for debug builds due to
>> the overhead introduced.
>
> How much overhead is it? We use debug builds during the dev cycle, so we need
> to make sure this is not noticeable.
Xen binary size without instrumentation: 1.3M
Xen binary size with instrumentation: 1.9M
Subjectively, the Xen boot time on ZCU102 hardware seems comparable.
On qemu-system-aarch64, unfortunately, I'm seeing about a 7x slowdown in
Xen boot time. It seems page allocation operations are particularly
affected.
>
> In any case, it would be better if this new option is under its own kconfig
> options. We can separately decide whether it is on or off by default when
> CONFIG_DEBUG=y.
Will do. Given the significant overhead when running on qemu, I think I
must reluctantly suggest that it be off by default. For CI runs with
real hardware, we could turn it on.
>
>>
>> Use per-cpu variables to store the stack base and nesting level. The
>> nesting level is used to avoid invoking __cyg_profile_func_enter
>> recursively.
>>
>> A check is added for whether per-cpu data has been initialized. Since
>> TPIDR_EL2 resets to an unknown value, initialize it to an invalid value.
>>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
>> ---
>> RFC: I only tested this on arm64. I still need to test with arm32.
>>
>> RFC: Should we put this under its own config option instead of reusing
>> CONFIG_DEBUG?
>
> See above.
>
>>
>> RFC: Is there a better value for INVALID_PER_CPU_OFFSET? We can't use 0
>> because 0 is a valid value for get_per_cpu_offset().
>
> I would consider to use ~0 because this is very unlikely to be used as an
> offset (at least on arm64).
Yep, ~0 works, will do.
>
>> ---
>> xen/arch/arm/arch.mk | 1 +
>> xen/arch/arm/arm64/head.S | 4 +++
>> xen/arch/arm/domain.c | 3 +++
>> xen/arch/arm/include/asm/page.h | 2 ++
>> xen/arch/arm/include/asm/traps.h | 8 ++++++
>> xen/arch/arm/setup.c | 4 +++
>> xen/arch/arm/smpboot.c | 3 +++
>> xen/arch/arm/traps.c | 45 ++++++++++++++++++++++++++++++++
>> 8 files changed, 70 insertions(+)
>>
>> diff --git a/xen/arch/arm/arch.mk b/xen/arch/arm/arch.mk
>> index 022dcda19224..c39cb65d183d 100644
>> --- a/xen/arch/arm/arch.mk
>> +++ b/xen/arch/arm/arch.mk
>> @@ -12,6 +12,7 @@ CFLAGS-$(CONFIG_ARM_32) += -mno-unaligned-access
>> CFLAGS-$(CONFIG_ARM_64) += -mcpu=generic
>> CFLAGS-$(CONFIG_ARM_64) += -mgeneral-regs-only # No fp registers etc
>> $(call cc-option-add,CFLAGS-$(CONFIG_ARM_64),CC,-mno-outline-atomics)
>> +CFLAGS-$(CONFIG_DEBUG) += -finstrument-functions
>> ifneq ($(filter command line environment,$(origin CONFIG_EARLY_PRINTK)),)
>> $(error You must use 'make menuconfig' to enable/disable early printk
>> now)
>> diff --git a/xen/arch/arm/arm64/head.S b/xen/arch/arm/arm64/head.S
>> index 2fa07dc3a04b..4a332b9cbc7c 100644
>> --- a/xen/arch/arm/arm64/head.S
>> +++ b/xen/arch/arm/arm64/head.S
>> @@ -399,6 +399,10 @@ FUNC_LOCAL(cpu_init)
>> * than SP_EL0.
>> */
>> msr spsel, #1
>> +
>
> I would add a comment explaining what we are doing.
Will do.
>
>> + ldr x0, =INVALID_PER_CPU_OFFSET
>> + msr tpidr_el2, x0
>> +
>> ret
>> END(cpu_init)
>> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
>> index 7cfcefd27944..93ebe6e5f8af 100644
>> --- a/xen/arch/arm/domain.c
>> +++ b/xen/arch/arm/domain.c
>> @@ -26,6 +26,7 @@
>> #include <asm/procinfo.h>
>> #include <asm/regs.h>
>> #include <asm/tee/tee.h>
>> +#include <asm/traps.h>
>> #include <asm/vfp.h>
>> #include <asm/vgic.h>
>> #include <asm/vtimer.h>
>> @@ -328,6 +329,8 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>> set_current(next);
>> + stack_set(next->arch.stack);
>> +
>> prev = __context_switch(prev, next);
>> schedule_tail(prev);
>> diff --git a/xen/arch/arm/include/asm/page.h
>> b/xen/arch/arm/include/asm/page.h
>> index 69f817d1e68a..6b5aaf1eb3ff 100644
>> --- a/xen/arch/arm/include/asm/page.h
>> +++ b/xen/arch/arm/include/asm/page.h
>> @@ -7,6 +7,8 @@
>> #include <asm/lpae.h>
>> #include <asm/sysregs.h>
>> +#define INVALID_PER_CPU_OFFSET _AC(0x1, UL)
>> +
>> /* Shareability values for the LPAE entries */
>> #define LPAE_SH_NON_SHAREABLE 0x0
>> #define LPAE_SH_UNPREDICTALE 0x1
>> diff --git a/xen/arch/arm/include/asm/traps.h
>> b/xen/arch/arm/include/asm/traps.h
>> index 9a60dbf70e5b..13e6997c2643 100644
>> --- a/xen/arch/arm/include/asm/traps.h
>> +++ b/xen/arch/arm/include/asm/traps.h
>> @@ -118,6 +118,14 @@ static inline register_t sign_extend(const struct
>> hsr_dabt dabt, register_t r)
>> void finalize_instr_emulation(const struct instr_details *instr);
>> +#ifdef CONFIG_DEBUG
>> +void stack_check_init(void);
>> +void stack_set(unsigned char *base);
>> +#else
>> +static inline void stack_check_init(void) { }
>> +static inline void stack_set(unsigned char *base) { }
>> +#endif
>> +
>> #endif /* __ASM_ARM_TRAPS__ */
>> /*
>> * Local variables:
>> diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
>> index 0c2fdaceaf21..07d07feff602 100644
>> --- a/xen/arch/arm/setup.c
>> +++ b/xen/arch/arm/setup.c
>> @@ -47,6 +47,7 @@
>> #include <asm/setup.h>
>> #include <xsm/xsm.h>
>> #include <asm/acpi.h>
>> +#include <asm/traps.h>
>> struct bootinfo __initdata bootinfo = BOOTINFO_INIT;
>> @@ -733,6 +734,8 @@ void asmlinkage __init start_xen(unsigned long
>> boot_phys_offset,
>> percpu_init_areas();
>> set_processor_id(0); /* needed early, for smp_processor_id() */
>> + stack_check_init();
>> +
>> /* Initialize traps early allow us to get backtrace when an error
>> occurred */
>> init_traps();
>> @@ -924,6 +927,7 @@ void asmlinkage __init start_xen(unsigned long
>> boot_phys_offset,
>> * since the static one we're running on is about to be freed. */
>> memcpy(idle_vcpu[0]->arch.cpu_info, get_cpu_info(),
>> sizeof(struct cpu_info));
>> + stack_set(idle_vcpu[0]->arch.stack);
>> switch_stack_and_jump(idle_vcpu[0]->arch.cpu_info, init_done);
>> }
>> diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
>> index 04e363088d60..1c689f2caed7 100644
>> --- a/xen/arch/arm/smpboot.c
>> +++ b/xen/arch/arm/smpboot.c
>> @@ -30,6 +30,7 @@
>> #include <asm/psci.h>
>> #include <asm/acpi.h>
>> #include <asm/tee/tee.h>
>> +#include <asm/traps.h>
>> /* Override macros from asm/page.h to make them work with mfn_t */
>> #undef virt_to_mfn
>> @@ -329,6 +330,8 @@ void asmlinkage start_secondary(void)
>> set_processor_id(cpuid);
>> + stack_check_init();
>> +
>> identify_cpu(¤t_cpu_data);
>> processor_setup();
>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
>> index aac6c599f878..b4890eec7ec4 100644
>> --- a/xen/arch/arm/traps.c
>> +++ b/xen/arch/arm/traps.c
>> @@ -2325,6 +2325,51 @@ void asmlinkage leave_hypervisor_to_guest(void)
>> arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 0, NULL);
>> }
>> +#ifdef CONFIG_DEBUG
>
> Loooking at the code below. Aside the stack pointer part, nothing seems
> specific to Arm. So maybe this could be moved to common code?
Yes, I suppose so.
>
>> +DEFINE_PER_CPU(unsigned int, stack_check_nesting);
>> +DEFINE_PER_CPU(unsigned char *, stack_base);
>
> I think this could be "const unsigned char *" as the stack should not be
> modified directly.
Every time there's a vcpu context switch we will have a new stack.
>
>> +
>> +void __attribute__((no_instrument_function)) stack_set(unsigned char *base)
>> +{
>> + this_cpu(stack_base) = base;
>> +}
>> +
>> +void __init __attribute__((no_instrument_function)) stack_check_init(void)
>> +{
>> + this_cpu(stack_check_nesting) = 0;
>> + stack_set(init_data.stack);
>> +}
>> +
>> +__attribute__((no_instrument_function))
>> +void __cyg_profile_func_enter(void *this_fn, void *call_site)
>> +{
>> + unsigned char *sp;
>> +
>> + if ( get_per_cpu_offset() == INVALID_PER_CPU_OFFSET )
>> + return;
>> +
>> + asm volatile ("mov %0, sp" : "=r" (sp) );
>> +
>> + if ( sp < this_cpu(stack_base) ||
>> + sp > (this_cpu(stack_base) + STACK_SIZE) )
>
> The top of the stack is used to store struct cpu_info. So you want to
> substract its size (see arch_vcpu_create()).
Will do.
>
>> + {
>> + if ( this_cpu(stack_check_nesting) )
>> + return;
>> +
>> + this_cpu(stack_check_nesting)++;
>
> Looking at the use, it only seems to be used as "print/panic once". So maybe
> use a bool to avoid any overflow.
It will only ever be incremented once. I'll still change it to a bool,
this should make it more obvious.
>
>> + printk("CPU %d stack pointer out of bounds (sp %#lx, stack base
>> %#lx)\n",
>> + smp_processor_id(), (uintptr_t)sp,
>> + (uintptr_t)this_cpu(stack_base));
>> + BUG();
>
> I would consider to call panic().
panic() alone doesn't show the stack trace / call trace.
> But is it safe to call any of this if we blew the stack?
Nope, it sure isn't!
> IOW, should we have a buffer?
Yes. After some experimentation, I found that this printk and a WARN
(similar to BUG, but resumes execution and allows me to collect these
metrics) uses approximately 1632 bytes of stack. Assuming BUG uses a
similar amount of stack as WARN, and adding in a comfortable margin for
error, I'll add a 4096 byte buffer (i.e. invoke the print/BUG with 4096
bytes remaining on the stack).
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |