[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [ARM] Native application design and discussion (I hope)
Hi Julien, On 25 April 2017 at 14:43, Julien Grall <julien.grall@xxxxxxx> wrote: >>>>>> We will also need another type of application: one which is >>>>>> periodically called by XEN itself, not actually servicing any domain >>>>>> request. This is needed for a >>>>>> coprocessor sharing framework scheduler implementation. >>>>> >>>>> >>>>> EL0 apps can be a powerful new tool for us to use, but they are not the >>>>> solution to everything. This is where I would draw the line: if the >>>>> workload needs to be scheduled periodically, then it is not a good fit >>>>> for an EL0 app. >>>> >>>> >>>> From my last conversation with Volodymyr I've got a feeling that notions >>>> "EL0" and "XEN native application" must be pretty orthogonal. >>>> In [1] Volodymyr got no performance gain from changing domain's >>>> exception level from EL1 to EL0. >>>> Only when Volodymyr stripped the domain's context abstraction (i.e. >>>> dropped GIC context store/restore) some noticeable results were reached. >>> >>> >>> >>> Do you have numbers for part that take times in the save/restore? You >>> mention GIC and I am a bit surprised you don't mention FPU. >> >> I did it in the other thread. Check out [1]. The most speed up I got >> after removing vGIC context handling > > > Oh, yes. Sorry I forgot this thread. Continuing on that, you said that "Now > profiler shows that hypervisor spends time in spinlocks and p2m code." > > Could you expand here? How the EL0 app will spend time in p2m code? I don't quite remember. It was somewhere around p2m save/restore context functions. I'll try to restore that setup and will provide more details. > Similarly, why spinlocks take time? Are they contented? Problem is that my profiler does not show stack, so I can't say which spinlock causes this. But profiler didn't showed that CPU spend much time in spinlock wait loop. So looks like there are no contention. >> >>> I would have a look at optimizing the context switch path. Some ideas: >>> - there are a lot of unnecessary isb/dsb. The registers used by >>> the >>> guests only will be synchronized by eret. >> >> I have removed (almost) all of them. No significant changes in latency. >> >>> - FPU is taking time to save/restore, you could make it lazy >> >> This also does not takes much time. >> >>> - It might be possible to limit the number of LRs saved/restored >>> depending on the number of LRs used by a domain. >> >> Excuse me, what is LR in this context? > > > Sorry I meant GIC LRs (see GIC save/restore code). They are used to list the > interrupts injected to the guest. All of they may not be used at the time of > the context switch. As I said, I don't call GIC save and restore routines, So, that should no be an issue (if I got that right). >> >> You can take a look at my context switch routines at [2]. > > > I had a quick look and I am not sure which context switch you exactly used > as you split it into 2 helpers but also modify the current one. > > Could you briefly describe the context switch you do for EL0 app here? As I said, I tried to reuse all existing services. My PoC hosts app in separate domain. Also this domain have own vcpu. So, at first I used the plain old ctxt_switch_from()/ctxt_switch_to() pair from domain.c. You know that those two functions save/restore almost all state of vCPU except pc, sp, lr and other general purpose registers. The remaining context is saved/restored in entry.S I just made v->arch.cpu_info->guest_cpu_user_regs.pc to point to app entry point and changed saved cpsr, to switch right into el0. Then I copied ctxt_switch_from()/ctxt_switch_to() to ctxt_switch_from_partial()/ctxt_switch_to_partial() and began to remove all unneeded code (dsb()'s\isb()'s, gic context handling, etc). So, overall flow is following: 0. If it is the first call, then I create 1:1 VM mapping and program ttbr0, ttbrc, mair registers of app vcpu. 1. I pause a calling vcpu 2. I program saved pc of app vcpu to point to the app entry point, sp to point to a top of a stack, cpsr to entry in el0 mode. 3. I call ctxt_switch_from_partial() to save context of calling vcpu 4. I enable TGE bit 5. I call ctx_switch_to_partial() to restore context of app vcpu 6. I call __save_context() to save rest of the context of calling vcpu (pc, sp, lr, r0-r31). 7. I invoke switch_stack_and_jump() to restore rest of the context of app vcpu 8. Now I'm in EL0 app. Hooray! App does something, invokes syscalls (which are handled in hypervisor) and so on. 9. App invoke syscall named app_exit() 10.I use ctxt_switch_from_partial() to save app state (actually it is not needed, I think) 11. I use ctxt_swtich_to_partial() to restore calling vcpu state 12. I unpause calling vcpu and drop TGE bit. 13. I call __restore_context() to restore pc, lr and friends. At this time code jumps back to p.6 (because I saved pc there). But it checks flag variable and sees that it is actually exit from app. 14. ... so it exits back to calling domain. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |