Xen project Mailing List

Re: [Xen-devel] [PATCH v3 34/47] xen/sched: add fall back to idle vcpu when scheduling unit

On 14.09.2019 10:52, Juergen Gross wrote: > When scheduling an unit with multiple vcpus there is no guarantee all > vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to > idle vcpu of the current cpu in that case. This requires to store the > correct schedule_unit pointer in the idle vcpu as long as it used as > fallback vcpu. > > In order to modify the runstates of the correct vcpus when switching > schedule units merge sched_unit_runstate_change() into > sched_switch_units() and loop over the affected physical cpus instead > of the unit's vcpus. This in turn requires an access function to the > current variable of other cpus. > > Today context_saved() is called in case previous and next vcpus differ > when doing a context switch. With an idle vcpu being capable to be a > substitute for an offline vcpu this is problematic when switching to > an idle scheduling unit. An idle previous vcpu leaves us in doubt which > schedule unit was active previously, so save the previous unit pointer > in the per-schedule resource area. If it is NULL the unit has not > changed and we don't have to set the previous unit to be not running. > > When running an idle vcpu in a non-idle scheduling unit use a specific > guest idle loop not performing any tasklets and livepatching in order > to avoid populating the cpu caches with memory used by other domains > (as far as possible). Softirqs are considered to be save. Aiui "tasklets" here means only ones run in (idle) vCPU context, whereas "softirqs" includes tasklets run in softirq context. I think it would help if the description made this distinction. With this I then wonder whether the cache related argumentation above still holds: VT-d's IOMMU fault handling tasklet runs in softirq context, for example, and hvm_assert_evtchn_irq(), being handed a struct vcpu *, does too. Of course this could be considered covered by "(as far as possible)" ... > @@ -172,6 +191,10 @@ void startup_cpu_idle_loop(void) > > static void noreturn continue_idle_domain(struct vcpu *v) > { > + /* Idle vcpus might be attached to non-idle units! */ > + if ( !is_idle_domain(v->sched_unit->domain) ) > + reset_stack_and_jump_nolp(guest_idle_loop); The construct and comment make me wonder - did you audit all uses of is_idle_{domain,vcpu}() for being in line with this new usage mode? > @@ -1767,33 +1774,66 @@ static void sched_switch_units(struct sched_resource > *sd, > struct sched_unit *next, struct sched_unit > *prev, > s_time_t now) > { > - sd->curr = next; > - > - TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, > prev->unit_id, > - now - prev->state_entry_time); > - TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, > next->unit_id, > - (next->vcpu_list->runstate.state == RUNSTATE_runnable) ? > - (now - next->state_entry_time) : 0, prev->next_time); > + int cpu; unsigned int? > ASSERT(unit_running(prev)); > > - TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id, > - next->domain->domain_id, next->unit_id); > + if ( prev != next ) > + { > + sd->curr = next; > + sd->prev = prev; > > - sched_unit_runstate_change(prev, false, now); > + TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, > + prev->unit_id, now - prev->state_entry_time); > + TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, > + next->unit_id, > + (next->vcpu_list->runstate.state == RUNSTATE_runnable) ? > + (now - next->state_entry_time) : 0, prev->next_time); > + TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id, > + next->domain->domain_id, next->unit_id); > > - ASSERT(!unit_running(next)); > - sched_unit_runstate_change(next, true, now); > + ASSERT(!unit_running(next)); > > - /* > - * NB. Don't add any trace records from here until the actual context > - * switch, else lost_records resume will not work properly. > - */ > + /* > + * NB. Don't add any trace records from here until the actual context > + * switch, else lost_records resume will not work properly. > + */ > + > + ASSERT(!next->is_running); > + next->is_running = 1; > + next->state_entry_time = now; > + > + if ( is_idle_unit(prev) ) > + { > + prev->runstate_cnt[RUNSTATE_running] = 0; > + prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity; > + } > + if ( is_idle_unit(next) ) > + { > + next->runstate_cnt[RUNSTATE_running] = sched_granularity; > + next->runstate_cnt[RUNSTATE_runnable] = 0; > + } Is this correct when some of the vCPU-s a substitute idle ones? > @@ -1849,29 +1889,39 @@ static struct sched_unit *do_schedule(struct > sched_unit *prev, s_time_t now, > if ( prev->next_time >= 0 ) /* -ve means no limit */ > set_timer(&sd->s_timer, now + prev->next_time); > > - if ( likely(prev != next) ) > - sched_switch_units(sd, next, prev, now); > + sched_switch_units(sd, next, prev, now); > > return next; > } > > -static void context_saved(struct vcpu *prev) > +static void vcpu_context_saved(struct vcpu *vprev, struct vcpu *vnext) > { > - struct sched_unit *unit = prev->sched_unit; > - > /* Clear running flag /after/ writing context to memory. */ > smp_wmb(); > > - prev->is_running = 0; > + if ( vprev != vnext ) > + vprev->is_running = 0; > +} With this "vnext" could be const qualified as it seems. And "false" instead of "0" perhaps, as you touch this anyway? > @@ -2015,7 +2079,8 @@ static void sched_slave(void) > > pcpu_schedule_unlock_irq(lock, cpu); > > - sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now); > + sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), > + is_idle_unit(next) && !is_idle_unit(prev), now); > } > > /* > @@ -2075,7 +2140,8 @@ static void schedule(void) > pcpu_schedule_unlock_irq(lock, cpu); > > vnext = sched_unit2vcpu_cpu(next, cpu); > - sched_context_switch(vprev, vnext, now); > + sched_context_switch(vprev, vnext, > + !is_idle_unit(prev) && is_idle_unit(next), now); > } As a minor remark, I think between such constructs it would be good if there was no difference, unless there's a reason to have one. Yet if there was a reason, it surely would want to be spelled out. > @@ -124,16 +129,22 @@ unsigned long get_stack_dump_bottom (unsigned long sp); > # define CHECK_FOR_LIVEPATCH_WORK "" > #endif > > -#define reset_stack_and_jump(__fn) \ > +#define switch_stack_and_jump(fn, instr) \ Is there any dependency on "instr" to just be a single instruction? I'm inclined to ask for it to be named "extra" or some such. > --- a/xen/include/asm-x86/smp.h > +++ b/xen/include/asm-x86/smp.h > @@ -76,6 +76,9 @@ void set_nr_sockets(void); > /* Representing HT and core siblings in each socket. */ > extern cpumask_t **socket_cpumask; > > +#define get_cpu_current(cpu) \ > + (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu) I don't think this can go without a comment clarifying under what (pretty narrow I think) conditions this is legitimate to use. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.