[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Minios-devel] [UNIKRAFT PATCH] plat/kvm: Fix current thread retrieval in interrupt context on x86_64



On 8/28/19 1:28 PM, Simon Kuenzer wrote:
> 
> 
> On 20.06.19 12:30, Costin Lupu wrote:
>> Hi Simon,
>>
>> Please see inline.
>>
>> On 6/19/19 12:35 AM, Simon Kuenzer wrote:
>>> Hey Costin,
>>>
>>> On 17.06.19 21:15, Costin Lupu wrote:
>>>> On 6/17/19 5:15 PM, Simon Kuenzer wrote:
>>>>> Hey Costin,
>>>>>
>>>>> thanks a lot for the patch. I have a couple of questions which I need
>>>>> for my understanding. I put those inline...
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Simon
>>>>>
>>>>> On 11.06.19 21:33, Costin Lupu wrote:
>>>>>> Commit 017fffd5 introduced support for setting the current thread
>>>>>> pointer on top
>>>>>> of interrupt stacks in order to retrieve the current thread in
>>>>>> interrupt context
>>>>>> as well. Unfortunately, the wrong stack was picked for KVM platform.
>>>>>> This patch
>>>>>> fixes that and sets the thread on cpu_intr_stack instead.
>>>>>>
>>>>>> cpu_intr_stack was resized to STACK_SIZE because this is a mandatory
>>>>>> condition
>>>>>> when saving threads on top. However, given that it also needs a
>>>>>> STACK_SIZE
>>>>>> alignment, a new section was created for it, .intrstack, in order to
>>>>>> avoid
>>>>>> breaking the entire binary image layout. Without this new section,
>>>>>> the
>>>>>> entire
>>>>>> .text section would have a STACK_SIZE alignment (i.e. 64KB) and this
>>>>>> would imply
>>>>>> that the multiboot header, which is included in .text section, would
>>>>>> also be
>>>>>> moved at an address higher than STACK_SIZE, even though it must stay
>>>>>> in the
>>>>>> first 8KB of the binary.
>>>>>
>>>>> Maybe, this is a stupid question: What if we take the pre-allocated
>>>>> bootstack also for the interrupts? As soon as we enable scheduling the
>>>>> bootstack is not used anymore by any thread. Except the case where
>>>>> we do
>>>>> not have scheduling, the bootstack will be still used during life
>>>>> time,
>>>>> would this be an issue?
>>>>>
>>>>
>>>> In theory this should work. But what about the case when we don't have
>>>> scheduling? Wouldn't we still need a different stack?
>>>>
>>>
>>> Hum... this should be fine, I expect the CPU can handle this.
>>
>> What do you mean? The CPU will handle it properly only if you provide it
>> a different stack on which it will save the registers of the interrupted
>> context.
>>
>>> But we have actually also stacks for the NMI and the trap handler. So
>>> maybe this suggestion is not generic enough.
>>
>> Of course it's not generic enough. One way would be to align NMI and
>> trap stacks as well, but that would just bloat the binary. Fortunately,
>> (1) the NMI stack isn't used and (2) current thread is not needed when
>> traps are handled, so we can skip updating these 2 stacks for now.
>>
>>>>>>
>>>>>> Signed-off-by: Costin Lupu <costin.lupu@xxxxxxxxx>
>>>>>> ---
>>>>>>     plat/kvm/Makefile.uk      |  1 +
>>>>>>     plat/kvm/memory.c         |  6 ------
>>>>>>     plat/kvm/x86/link64.lds.S | 10 ++++++++++
>>>>>>     plat/kvm/x86/memory.c     | 44
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>     plat/kvm/x86/traps.c      |  5 ++++-
>>>>>>     5 files changed, 59 insertions(+), 7 deletions(-)
>>>>>>     create mode 100644 plat/kvm/x86/memory.c
>>>>>>
>>>>>> diff --git a/plat/kvm/Makefile.uk b/plat/kvm/Makefile.uk
>>>>>> index 71c4c419..8eb162d4 100644
>>>>>> --- a/plat/kvm/Makefile.uk
>>>>>> +++ b/plat/kvm/Makefile.uk
>>>>>> @@ -46,6 +46,7 @@ LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
>>>>>> $(LIBKVMPLAT_BASE)/x86/lcpu.c
>>>>>>     LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
>>>>>> $(LIBKVMPLAT_BASE)/x86/intctrl.c
>>>>>>     LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
>>>>>> $(LIBKVMPLAT_BASE)/x86/tscclock.c
>>>>>>     LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
>>>>>> $(LIBKVMPLAT_BASE)/x86/time.c
>>>>>> +LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
>>>>>> $(LIBKVMPLAT_BASE)/x86/memory.c|x86
>>>>>>     ifeq ($(findstring y,$(CONFIG_KVM_KERNEL_VGA_CONSOLE)
>>>>>> $(CONFIG_KVM_DEBUG_VGA_CONSOLE)),y)
>>>>>>     LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
>>>>>> $(LIBKVMPLAT_BASE)/x86/vga_console.c
>>>>>>     endif
>>>>>> diff --git a/plat/kvm/memory.c b/plat/kvm/memory.c
>>>>>> index e96bad2c..7f2fb46a 100644
>>>>>> --- a/plat/kvm/memory.c
>>>>>> +++ b/plat/kvm/memory.c
>>>>>> @@ -178,9 +178,3 @@ int _ukplat_mem_mappings_init(void)
>>>>>>     {
>>>>>>         return 0;
>>>>>>     }
>>>>>> -
>>>>>> -void ukplat_stack_set_current_thread(void *thread_addr)
>>>>>> -{
>>>>>> -    *((unsigned long *) _libkvmplat_cfg.bstack.end) =
>>>>>> -        (unsigned long) thread_addr;
>>>>>> -}
>>>>>> diff --git a/plat/kvm/x86/link64.lds.S b/plat/kvm/x86/link64.lds.S
>>>>>> index 362ba3e6..6103fc2d 100644
>>>>>> --- a/plat/kvm/x86/link64.lds.S
>>>>>> +++ b/plat/kvm/x86/link64.lds.S
>>>>>> @@ -99,6 +99,16 @@ SECTIONS
>>>>>>             . = ALIGN(__PAGE_SIZE);
>>>>>>         }
>>>>>>     +    /* We keep the interrupt stack on a different section
>>>>>> +     * given that it may have a big alignment and it would
>>>>>> +     * change the entire binary layout
>>>>>> +     */
>>>>>> +    .intrstack :
>>>>>> +    {
>>>>>> +        *(.intrstack)
>>>>>> +        . = ALIGN(__PAGE_SIZE);
>>>>>> +    }
>>>>>> +
>>>>>
>>>>> Would every platform need to do this?
>>>>>
>>>>
>>>> On Xen, the interrupt is already aligned to STACK_SIZE. On linuxu, the
>>>> signals use the current process stack, we should change it to using an
>>>> alternative stack if we want to use the same approach.
>>>>
>>>>>>         _end = .;
>>>>>>           .comment       0 : { *(.comment) }
>>>>>> diff --git a/plat/kvm/x86/memory.c b/plat/kvm/x86/memory.c
>>>>>> new file mode 100644
>>>>>> index 00000000..b8c7c7e7
>>>>>> --- /dev/null
>>>>>> +++ b/plat/kvm/x86/memory.c
>>>>>> @@ -0,0 +1,44 @@
>>>>>> +/* SPDX-License-Identifier: BSD-3-Clause */
>>>>>> +/*
>>>>>> + * Authors: Costin Lupu <costin.lupu@xxxxxxxxx>
>>>>>> + *
>>>>>> + * Copyright (c) 2019, University Politehnica of Bucharest. All
>>>>>> rights reserved.
>>>>>> + *
>>>>>> + * Redistribution and use in source and binary forms, with or
>>>>>> without
>>>>>> + * modification, are permitted provided that the following
>>>>>> conditions
>>>>>> + * are met:
>>>>>> + *
>>>>>> + * 1. Redistributions of source code must retain the above copyright
>>>>>> + *    notice, this list of conditions and the following disclaimer.
>>>>>> + * 2. Redistributions in binary form must reproduce the above
>>>>>> copyright
>>>>>> + *    notice, this list of conditions and the following
>>>>>> disclaimer in
>>>>>> the
>>>>>> + *    documentation and/or other materials provided with the
>>>>>> distribution.
>>>>>> + * 3. Neither the name of the copyright holder nor the names of its
>>>>>> + *    contributors may be used to endorse or promote products
>>>>>> derived
>>>>>> from
>>>>>> + *    this software without specific prior written permission.
>>>>>> + *
>>>>>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
>>>>>> CONTRIBUTORS "AS IS"
>>>>>> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
>>>>>> TO, THE
>>>>>> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
>>>>>> PARTICULAR
>>>>>> PURPOSE
>>>>>> + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
>>>>>> CONTRIBUTORS BE
>>>>>> + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
>>>>>> EXEMPLARY, OR
>>>>>> + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
>>>>>> PROCUREMENT OF
>>>>>> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
>>>>>> BUSINESS
>>>>>> + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
>>>>>> WHETHER IN
>>>>>> + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
>>>>>> OTHERWISE)
>>>>>> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
>>>>>> ADVISED OF THE
>>>>>> + * POSSIBILITY OF SUCH DAMAGE.
>>>>>> + *
>>>>>> + * THIS HEADER MAY NOT BE EXTRACTED OR MODIFIED IN ANY WAY.
>>>>>> + */
>>>>>> +
>>>>>> +#include <uk/plat/memory.h>
>>>>>> +
>>>>>> +
>>>>>> +extern char cpu_intr_stack[];
>>>>>> +
>>>>>> +void ukplat_stack_set_current_thread(void *thread_addr)
>>>>>> +{
>>>>>> +    *((unsigned long *) cpu_intr_stack) =
>>>>>> +        (unsigned long) thread_addr;
>>>>>
>>>>> In general, instead of using the stack top, we could save the current
>>>>> thread always on the stack bottom. Those bytes would be reserved and
>>>>> would only be popped on a misbehaving program. Would that avoid the
>>>>> issue that you are facing? Maybe we could get away from the problem
>>>>> that
>>>>> the stacks have to have a fixed size for just getting the current
>>>>> value...
>>>>>
>>>>
>>>> How do you determine the stack bottom when you are using it, e.g. when
>>>> retrieving the current thread? I think you would use the same way as
>>>> for
>>>> getting the current thread from the stack top. So you still need this
>>>> alignment constraint.
>>>>
>>>
>>> You are right.
>>>
>>>>>> +}
>>>>>> diff --git a/plat/kvm/x86/traps.c b/plat/kvm/x86/traps.c
>>>>>> index 27ef6d93..fe1dd5a4 100644
>>>>>> --- a/plat/kvm/x86/traps.c
>>>>>> +++ b/plat/kvm/x86/traps.c
>>>>>> @@ -25,7 +25,9 @@
>>>>>>      */
>>>>>>       #include <string.h>
>>>>>> +#include <uk/essentials.h>
>>>>>>     #include <uk/arch/lcpu.h>
>>>>>> +#include <uk/plat/config.h>
>>>>>>     #include <x86/desc.h>
>>>>>>     #include <kvm-x86/traps.h>
>>>>>>     @@ -59,7 +61,8 @@ static void gdt_init(void)
>>>>>>       static struct tss64 cpu_tss;
>>>>>>     -static char cpu_intr_stack[4096]; /* IST1 */
>>>>>> + __section(".intrstack")  __align(STACK_SIZE)
>>>>>> +char cpu_intr_stack[STACK_SIZE];  /* IST1 */
>>>>>
>>>>> You removed the stack actually from the .bss section (instead of text)
>>>>> and moved it to an own section. What if you keep it on the .bss but
>>>>> with
>>>>> the bigger size? I expect this should not be a problem for the text
>>>>> section.
>>>>>
>>>>
>>>> In the commit message it should have been '.text segment', or the first
>>>> segment of the binary, which contains both .text and .bss sections. So,
>>>> yeah, it's the segment that gets realigned. If we don't move it to
>>>> another section, it would remain in .bss, which is what this fix tries
>>>> to avoid. If we move it to a new section, it will also be moved to a
>>>> different segment and that's how the segment containing .text will keep
>>>> its original alignment.
>>>>
>>>
>>> I am still not getting exactly why the .bss section has influence on the
>>> positioning on the .text section. The multiboot header is just one
>>> exceptional data section that we put into the beginning of the .text
>>> section. The rest should still go to its respective section. What is the
>>> bad thing about using the .bss section?
>>>
>>
>> Aligning cpu_intr_stack to 0x10000 also aligns .bss section to 0x10000.
>> For reasons I do not know, the linker also decides to move the .text
>> section to 0x10000 offset inside the binary file (please see
>> attachment). Now this is bad, because the .text section also contains
>> the multiboot information which must stay in the first 0x2000 bytes of
>> the binary. Moreover, the segment which will contain both sections in
>> the end will be aligned to 0x10000.
>>
>> One solution would be to move multiboot information in a section of its
>> own, out of the .text section. But unfortunately, we still cannot
>> control the offset where the linker will decide to put it in the binary.
>>  From what I could find, the only way to fix that would be to put it with
>> objcopy at the beginning of the binary.
>>
>> Therefore, the least intrusive solution would be to just move the
>> aligned interrupt stack in a section of its own and the rest of the
>> binary would just keep its original layout.
>>
>>> Anyways, since we need a quick fix, I would suggest something else. The
>>> problem I have with the irq_stack section is that it complicates the
>>> linked layout. Additionally for being complete, the nmi and trap stack
>>> would also need to go in there. The other problem we have is that the
>>> stack sizes and/or their alignment need to be fulfilled.
>>>   - If we put the thread context pointer on top of the stack (low
>>>     address), all stacks need to be exactly sized and aligned to
>>>     STACK_SIZE.
>>>   - If we put this pointer to the stack bottom (high address), we need
>>>     to make sure that the high address is aligned to STACK_SIZE.
>>>     STACK_SIZE becomes automatically the maximum stack size but we could
>>>     have stacks that are smaller sized than this. However, I am not
>>>     sure how we could teach the alignment to the linker.
>>>
>>> Although I do not like the alternative but it may solve our bug right
>>> now: What if we store the current thread pointer to a platform-internal
>>> variable instead of the stack. On every context switch, we update this
>>> variable and get_current() is returning its value. In principle, with
>>> SMP you would need to introduce this variable CPU-wise but we do not
>>> have this now. This way we could keep the stack sizes that we currently
>>> have.
>>> I think we have to revisit this low-level platform API at some point
>>> considering all the lessons we have learned and that longer-term we
>>> should also protect the each stacks from over- and underflows (for
>>> example with an unmapped page before and after each stack). But this
>>> involves a bigger restructuring anyways and this we should not do with
>>> this patch.
>>>
>>> What do you think?
>>>
>>
>> This is something we need to do especially if we want to support
>> different stack sizes in the system. I also had this discussion with
>> Florian a few weeks ago about how this is also necessary in order to
>> avoid some races in the scheduler for which we had to apply a workaround.
>>
>> However, this needs a more complex analysis and for sure it would have a
>> big impact, with potential side effects, for the whole system. I would
>> wait to do that until we'll have a proper regression testing system.
>>
>> This current fix is the simplest one for now and it doesn't bring any
>> side effects. I strongly believe this is the best solution for now.
> 
> Hum. Understood. Yeah, I agree we need to revisit this when we are going
> re-visiting the API and re-architecture the platform interfaces. As far
> as I can see, we are currently using the same technique to save and get
> the current thread context also on the other platforms. Because of
> consistency reasons, it would not be wise to change this to something
> completely different for the KVM platform.
> 
> Could you make sure with a v2 that also the trap stack is properly sized
> and aligned? This implies storing the current thread pointer twice,
> right? I am not really a fan of this but if we want to keep get_current
> in its current form in order to avoid further implication to uksched and
> the other platforms for now, we should at least have a TODO comment
> there that explains our reason. I expect we will have a better suited
> platform API later that will avoid this, so this would be anyways
> temporary.

Alright, I sent a v2 on the list including amendment according to your
comments.

Cheers,
Costin

_______________________________________________
Minios-devel mailing list
Minios-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/minios-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.