Xen project Mailing List

Re: [Minios-devel] [UNIKRAFT PATCH] plat/kvm: Fix current thread retrieval in interrupt context on x86_64

To: Simon Kuenzer <simon.kuenzer@xxxxxxxxx>, minios-devel@xxxxxxxxxxxxx

From: Costin Lupu <costin.lup@xxxxxxxxx>

Date: Wed, 28 Aug 2019 17:40:22 +0300

Cc: felipe.huici@xxxxxxxxx, sharan.santhanam@xxxxxxxxx

Delivery-date: Wed, 28 Aug 2019 14:40:30 +0000

List-id: Mini-os development list <minios-devel.lists.xenproject.org>

On 8/28/19 1:28 PM, Simon Kuenzer wrote: > > > On 20.06.19 12:30, Costin Lupu wrote: >> Hi Simon, >> >> Please see inline. >> >> On 6/19/19 12:35 AM, Simon Kuenzer wrote: >>> Hey Costin, >>> >>> On 17.06.19 21:15, Costin Lupu wrote: >>>> On 6/17/19 5:15 PM, Simon Kuenzer wrote: >>>>> Hey Costin, >>>>> >>>>> thanks a lot for the patch. I have a couple of questions which I need >>>>> for my understanding. I put those inline... >>>>> >>>>> Thanks, >>>>> >>>>> Simon >>>>> >>>>> On 11.06.19 21:33, Costin Lupu wrote: >>>>>> Commit 017fffd5 introduced support for setting the current thread >>>>>> pointer on top >>>>>> of interrupt stacks in order to retrieve the current thread in >>>>>> interrupt context >>>>>> as well. Unfortunately, the wrong stack was picked for KVM platform. >>>>>> This patch >>>>>> fixes that and sets the thread on cpu_intr_stack instead. >>>>>> >>>>>> cpu_intr_stack was resized to STACK_SIZE because this is a mandatory >>>>>> condition >>>>>> when saving threads on top. However, given that it also needs a >>>>>> STACK_SIZE >>>>>> alignment, a new section was created for it, .intrstack, in order to >>>>>> avoid >>>>>> breaking the entire binary image layout. Without this new section, >>>>>> the >>>>>> entire >>>>>> .text section would have a STACK_SIZE alignment (i.e. 64KB) and this >>>>>> would imply >>>>>> that the multiboot header, which is included in .text section, would >>>>>> also be >>>>>> moved at an address higher than STACK_SIZE, even though it must stay >>>>>> in the >>>>>> first 8KB of the binary. >>>>> >>>>> Maybe, this is a stupid question: What if we take the pre-allocated >>>>> bootstack also for the interrupts? As soon as we enable scheduling the >>>>> bootstack is not used anymore by any thread. Except the case where >>>>> we do >>>>> not have scheduling, the bootstack will be still used during life >>>>> time, >>>>> would this be an issue? >>>>> >>>> >>>> In theory this should work. But what about the case when we don't have >>>> scheduling? Wouldn't we still need a different stack? >>>> >>> >>> Hum... this should be fine, I expect the CPU can handle this. >> >> What do you mean? The CPU will handle it properly only if you provide it >> a different stack on which it will save the registers of the interrupted >> context. >> >>> But we have actually also stacks for the NMI and the trap handler. So >>> maybe this suggestion is not generic enough. >> >> Of course it's not generic enough. One way would be to align NMI and >> trap stacks as well, but that would just bloat the binary. Fortunately, >> (1) the NMI stack isn't used and (2) current thread is not needed when >> traps are handled, so we can skip updating these 2 stacks for now. >> >>>>>> >>>>>> Signed-off-by: Costin Lupu <costin.lupu@xxxxxxxxx> >>>>>> --- >>>>>> plat/kvm/Makefile.uk | 1 + >>>>>> plat/kvm/memory.c | 6 ------ >>>>>> plat/kvm/x86/link64.lds.S | 10 ++++++++++ >>>>>> plat/kvm/x86/memory.c | 44 >>>>>> ++++++++++++++++++++++++++++++++++++++++++++ >>>>>> plat/kvm/x86/traps.c | 5 ++++- >>>>>> 5 files changed, 59 insertions(+), 7 deletions(-) >>>>>> create mode 100644 plat/kvm/x86/memory.c >>>>>> >>>>>> diff --git a/plat/kvm/Makefile.uk b/plat/kvm/Makefile.uk >>>>>> index 71c4c419..8eb162d4 100644 >>>>>> --- a/plat/kvm/Makefile.uk >>>>>> +++ b/plat/kvm/Makefile.uk >>>>>> @@ -46,6 +46,7 @@ LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) += >>>>>> $(LIBKVMPLAT_BASE)/x86/lcpu.c >>>>>> LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) += >>>>>> $(LIBKVMPLAT_BASE)/x86/intctrl.c >>>>>> LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) += >>>>>> $(LIBKVMPLAT_BASE)/x86/tscclock.c >>>>>> LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) += >>>>>> $(LIBKVMPLAT_BASE)/x86/time.c >>>>>> +LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) += >>>>>> $(LIBKVMPLAT_BASE)/x86/memory.c|x86 >>>>>> ifeq ($(findstring y,$(CONFIG_KVM_KERNEL_VGA_CONSOLE) >>>>>> $(CONFIG_KVM_DEBUG_VGA_CONSOLE)),y) >>>>>> LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) += >>>>>> $(LIBKVMPLAT_BASE)/x86/vga_console.c >>>>>> endif >>>>>> diff --git a/plat/kvm/memory.c b/plat/kvm/memory.c >>>>>> index e96bad2c..7f2fb46a 100644 >>>>>> --- a/plat/kvm/memory.c >>>>>> +++ b/plat/kvm/memory.c >>>>>> @@ -178,9 +178,3 @@ int _ukplat_mem_mappings_init(void) >>>>>> { >>>>>> return 0; >>>>>> } >>>>>> - >>>>>> -void ukplat_stack_set_current_thread(void *thread_addr) >>>>>> -{ >>>>>> - *((unsigned long *) _libkvmplat_cfg.bstack.end) = >>>>>> - (unsigned long) thread_addr; >>>>>> -} >>>>>> diff --git a/plat/kvm/x86/link64.lds.S b/plat/kvm/x86/link64.lds.S >>>>>> index 362ba3e6..6103fc2d 100644 >>>>>> --- a/plat/kvm/x86/link64.lds.S >>>>>> +++ b/plat/kvm/x86/link64.lds.S >>>>>> @@ -99,6 +99,16 @@ SECTIONS >>>>>> . = ALIGN(__PAGE_SIZE); >>>>>> } >>>>>> + /* We keep the interrupt stack on a different section >>>>>> + * given that it may have a big alignment and it would >>>>>> + * change the entire binary layout >>>>>> + */ >>>>>> + .intrstack : >>>>>> + { >>>>>> + *(.intrstack) >>>>>> + . = ALIGN(__PAGE_SIZE); >>>>>> + } >>>>>> + >>>>> >>>>> Would every platform need to do this? >>>>> >>>> >>>> On Xen, the interrupt is already aligned to STACK_SIZE. On linuxu, the >>>> signals use the current process stack, we should change it to using an >>>> alternative stack if we want to use the same approach. >>>> >>>>>> _end = .; >>>>>> .comment 0 : { *(.comment) } >>>>>> diff --git a/plat/kvm/x86/memory.c b/plat/kvm/x86/memory.c >>>>>> new file mode 100644 >>>>>> index 00000000..b8c7c7e7 >>>>>> --- /dev/null >>>>>> +++ b/plat/kvm/x86/memory.c >>>>>> @@ -0,0 +1,44 @@ >>>>>> +/* SPDX-License-Identifier: BSD-3-Clause */ >>>>>> +/* >>>>>> + * Authors: Costin Lupu <costin.lupu@xxxxxxxxx> >>>>>> + * >>>>>> + * Copyright (c) 2019, University Politehnica of Bucharest. All >>>>>> rights reserved. >>>>>> + * >>>>>> + * Redistribution and use in source and binary forms, with or >>>>>> without >>>>>> + * modification, are permitted provided that the following >>>>>> conditions >>>>>> + * are met: >>>>>> + * >>>>>> + * 1. Redistributions of source code must retain the above copyright >>>>>> + * notice, this list of conditions and the following disclaimer. >>>>>> + * 2. Redistributions in binary form must reproduce the above >>>>>> copyright >>>>>> + * notice, this list of conditions and the following >>>>>> disclaimer in >>>>>> the >>>>>> + * documentation and/or other materials provided with the >>>>>> distribution. >>>>>> + * 3. Neither the name of the copyright holder nor the names of its >>>>>> + * contributors may be used to endorse or promote products >>>>>> derived >>>>>> from >>>>>> + * this software without specific prior written permission. >>>>>> + * >>>>>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND >>>>>> CONTRIBUTORS "AS IS" >>>>>> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED >>>>>> TO, THE >>>>>> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A >>>>>> PARTICULAR >>>>>> PURPOSE >>>>>> + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR >>>>>> CONTRIBUTORS BE >>>>>> + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, >>>>>> EXEMPLARY, OR >>>>>> + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, >>>>>> PROCUREMENT OF >>>>>> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR >>>>>> BUSINESS >>>>>> + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, >>>>>> WHETHER IN >>>>>> + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR >>>>>> OTHERWISE) >>>>>> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF >>>>>> ADVISED OF THE >>>>>> + * POSSIBILITY OF SUCH DAMAGE. >>>>>> + * >>>>>> + * THIS HEADER MAY NOT BE EXTRACTED OR MODIFIED IN ANY WAY. >>>>>> + */ >>>>>> + >>>>>> +#include <uk/plat/memory.h> >>>>>> + >>>>>> + >>>>>> +extern char cpu_intr_stack[]; >>>>>> + >>>>>> +void ukplat_stack_set_current_thread(void *thread_addr) >>>>>> +{ >>>>>> + *((unsigned long *) cpu_intr_stack) = >>>>>> + (unsigned long) thread_addr; >>>>> >>>>> In general, instead of using the stack top, we could save the current >>>>> thread always on the stack bottom. Those bytes would be reserved and >>>>> would only be popped on a misbehaving program. Would that avoid the >>>>> issue that you are facing? Maybe we could get away from the problem >>>>> that >>>>> the stacks have to have a fixed size for just getting the current >>>>> value... >>>>> >>>> >>>> How do you determine the stack bottom when you are using it, e.g. when >>>> retrieving the current thread? I think you would use the same way as >>>> for >>>> getting the current thread from the stack top. So you still need this >>>> alignment constraint. >>>> >>> >>> You are right. >>> >>>>>> +} >>>>>> diff --git a/plat/kvm/x86/traps.c b/plat/kvm/x86/traps.c >>>>>> index 27ef6d93..fe1dd5a4 100644 >>>>>> --- a/plat/kvm/x86/traps.c >>>>>> +++ b/plat/kvm/x86/traps.c >>>>>> @@ -25,7 +25,9 @@ >>>>>> */ >>>>>> #include <string.h> >>>>>> +#include <uk/essentials.h> >>>>>> #include <uk/arch/lcpu.h> >>>>>> +#include <uk/plat/config.h> >>>>>> #include <x86/desc.h> >>>>>> #include <kvm-x86/traps.h> >>>>>> @@ -59,7 +61,8 @@ static void gdt_init(void) >>>>>> static struct tss64 cpu_tss; >>>>>> -static char cpu_intr_stack[4096]; /* IST1 */ >>>>>> + __section(".intrstack") __align(STACK_SIZE) >>>>>> +char cpu_intr_stack[STACK_SIZE]; /* IST1 */ >>>>> >>>>> You removed the stack actually from the .bss section (instead of text) >>>>> and moved it to an own section. What if you keep it on the .bss but >>>>> with >>>>> the bigger size? I expect this should not be a problem for the text >>>>> section. >>>>> >>>> >>>> In the commit message it should have been '.text segment', or the first >>>> segment of the binary, which contains both .text and .bss sections. So, >>>> yeah, it's the segment that gets realigned. If we don't move it to >>>> another section, it would remain in .bss, which is what this fix tries >>>> to avoid. If we move it to a new section, it will also be moved to a >>>> different segment and that's how the segment containing .text will keep >>>> its original alignment. >>>> >>> >>> I am still not getting exactly why the .bss section has influence on the >>> positioning on the .text section. The multiboot header is just one >>> exceptional data section that we put into the beginning of the .text >>> section. The rest should still go to its respective section. What is the >>> bad thing about using the .bss section? >>> >> >> Aligning cpu_intr_stack to 0x10000 also aligns .bss section to 0x10000. >> For reasons I do not know, the linker also decides to move the .text >> section to 0x10000 offset inside the binary file (please see >> attachment). Now this is bad, because the .text section also contains >> the multiboot information which must stay in the first 0x2000 bytes of >> the binary. Moreover, the segment which will contain both sections in >> the end will be aligned to 0x10000. >> >> One solution would be to move multiboot information in a section of its >> own, out of the .text section. But unfortunately, we still cannot >> control the offset where the linker will decide to put it in the binary. >> From what I could find, the only way to fix that would be to put it with >> objcopy at the beginning of the binary. >> >> Therefore, the least intrusive solution would be to just move the >> aligned interrupt stack in a section of its own and the rest of the >> binary would just keep its original layout. >> >>> Anyways, since we need a quick fix, I would suggest something else. The >>> problem I have with the irq_stack section is that it complicates the >>> linked layout. Additionally for being complete, the nmi and trap stack >>> would also need to go in there. The other problem we have is that the >>> stack sizes and/or their alignment need to be fulfilled. >>> - If we put the thread context pointer on top of the stack (low >>> address), all stacks need to be exactly sized and aligned to >>> STACK_SIZE. >>> - If we put this pointer to the stack bottom (high address), we need >>> to make sure that the high address is aligned to STACK_SIZE. >>> STACK_SIZE becomes automatically the maximum stack size but we could >>> have stacks that are smaller sized than this. However, I am not >>> sure how we could teach the alignment to the linker. >>> >>> Although I do not like the alternative but it may solve our bug right >>> now: What if we store the current thread pointer to a platform-internal >>> variable instead of the stack. On every context switch, we update this >>> variable and get_current() is returning its value. In principle, with >>> SMP you would need to introduce this variable CPU-wise but we do not >>> have this now. This way we could keep the stack sizes that we currently >>> have. >>> I think we have to revisit this low-level platform API at some point >>> considering all the lessons we have learned and that longer-term we >>> should also protect the each stacks from over- and underflows (for >>> example with an unmapped page before and after each stack). But this >>> involves a bigger restructuring anyways and this we should not do with >>> this patch. >>> >>> What do you think? >>> >> >> This is something we need to do especially if we want to support >> different stack sizes in the system. I also had this discussion with >> Florian a few weeks ago about how this is also necessary in order to >> avoid some races in the scheduler for which we had to apply a workaround. >> >> However, this needs a more complex analysis and for sure it would have a >> big impact, with potential side effects, for the whole system. I would >> wait to do that until we'll have a proper regression testing system. >> >> This current fix is the simplest one for now and it doesn't bring any >> side effects. I strongly believe this is the best solution for now. > > Hum. Understood. Yeah, I agree we need to revisit this when we are going > re-visiting the API and re-architecture the platform interfaces. As far > as I can see, we are currently using the same technique to save and get > the current thread context also on the other platforms. Because of > consistency reasons, it would not be wise to change this to something > completely different for the KVM platform. > > Could you make sure with a v2 that also the trap stack is properly sized > and aligned? This implies storing the current thread pointer twice, > right? I am not really a fan of this but if we want to keep get_current > in its current form in order to avoid further implication to uksched and > the other platforms for now, we should at least have a TODO comment > there that explains our reason. I expect we will have a better suited > platform API later that will avoid this, so this would be anyways > temporary. Alright, I sent a v2 on the list including amendment according to your comments. Cheers, Costin _______________________________________________ Minios-devel mailing list Minios-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/minios-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.