[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v3 34/47] xen/sched: add fall back to idle vcpu when scheduling unit
On 24.09.19 12:53, Jan Beulich wrote: On 14.09.2019 10:52, Juergen Gross wrote:When scheduling an unit with multiple vcpus there is no guarantee all vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to idle vcpu of the current cpu in that case. This requires to store the correct schedule_unit pointer in the idle vcpu as long as it used as fallback vcpu. In order to modify the runstates of the correct vcpus when switching schedule units merge sched_unit_runstate_change() into sched_switch_units() and loop over the affected physical cpus instead of the unit's vcpus. This in turn requires an access function to the current variable of other cpus. Today context_saved() is called in case previous and next vcpus differ when doing a context switch. With an idle vcpu being capable to be a substitute for an offline vcpu this is problematic when switching to an idle scheduling unit. An idle previous vcpu leaves us in doubt which schedule unit was active previously, so save the previous unit pointer in the per-schedule resource area. If it is NULL the unit has not changed and we don't have to set the previous unit to be not running. When running an idle vcpu in a non-idle scheduling unit use a specific guest idle loop not performing any tasklets and livepatching in order to avoid populating the cpu caches with memory used by other domains (as far as possible). Softirqs are considered to be save.Aiui "tasklets" here means only ones run in (idle) vCPU context, whereas "softirqs" includes tasklets run in softirq context. I think it would help if the description made this distinction. With this I then wonder whether the cache related argumentation above still holds: VT-d's IOMMU fault handling tasklet runs in softirq context, for example, and hvm_assert_evtchn_irq(), being handed a struct vcpu *, does too. Of course this could be considered covered by "(as far as possible)" ... I'll write "... not performing any non-softirq tasklets ...". @@ -172,6 +191,10 @@ void startup_cpu_idle_loop(void)static void noreturn continue_idle_domain(struct vcpu *v){ + /* Idle vcpus might be attached to non-idle units! */ + if ( !is_idle_domain(v->sched_unit->domain) ) + reset_stack_and_jump_nolp(guest_idle_loop);The construct and comment make me wonder - did you audit all uses of is_idle_{domain,vcpu}() for being in line with this new usage mode? Yes. @@ -1767,33 +1774,66 @@ static void sched_switch_units(struct sched_resource *sd, struct sched_unit *next, struct sched_unit *prev, s_time_t now) { - sd->curr = next; - - TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id, - now - prev->state_entry_time); - TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id, - (next->vcpu_list->runstate.state == RUNSTATE_runnable) ? - (now - next->state_entry_time) : 0, prev->next_time); + int cpu;unsigned int? Okay. ASSERT(unit_running(prev));- TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,- next->domain->domain_id, next->unit_id); + if ( prev != next ) + { + sd->curr = next; + sd->prev = prev;- sched_unit_runstate_change(prev, false, now);+ TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, + prev->unit_id, now - prev->state_entry_time); + TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, + next->unit_id, + (next->vcpu_list->runstate.state == RUNSTATE_runnable) ? + (now - next->state_entry_time) : 0, prev->next_time); + TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id, + next->domain->domain_id, next->unit_id);- ASSERT(!unit_running(next));- sched_unit_runstate_change(next, true, now); + ASSERT(!unit_running(next));- /*- * NB. Don't add any trace records from here until the actual context - * switch, else lost_records resume will not work properly. - */ + /* + * NB. Don't add any trace records from here until the actual context + * switch, else lost_records resume will not work properly. + */ + + ASSERT(!next->is_running); + next->is_running = 1; + next->state_entry_time = now; + + if ( is_idle_unit(prev) ) + { + prev->runstate_cnt[RUNSTATE_running] = 0; + prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity; + } + if ( is_idle_unit(next) ) + { + next->runstate_cnt[RUNSTATE_running] = sched_granularity; + next->runstate_cnt[RUNSTATE_runnable] = 0; + }Is this correct when some of the vCPU-s a substitute idle ones? Yes, as this affects idle units only. An idle vcpu running as a substitute will _not_ result in the related idle unit runstate_cnt to be updated. @@ -1849,29 +1889,39 @@ static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now, if ( prev->next_time >= 0 ) /* -ve means no limit */ set_timer(&sd->s_timer, now + prev->next_time);- if ( likely(prev != next) )- sched_switch_units(sd, next, prev, now); + sched_switch_units(sd, next, prev, now);return next;}-static void context_saved(struct vcpu *prev)+static void vcpu_context_saved(struct vcpu *vprev, struct vcpu *vnext) { - struct sched_unit *unit = prev->sched_unit; - /* Clear running flag /after/ writing context to memory. */ smp_wmb();- prev->is_running = 0;+ if ( vprev != vnext ) + vprev->is_running = 0; +}With this "vnext" could be const qualified as it seems. And "false" instead of "0" perhaps, as you touch this anyway? Okay. @@ -2015,7 +2079,8 @@ static void sched_slave(void)pcpu_schedule_unlock_irq(lock, cpu); - sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);+ sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), + is_idle_unit(next) && !is_idle_unit(prev), now); }/*@@ -2075,7 +2140,8 @@ static void schedule(void) pcpu_schedule_unlock_irq(lock, cpu);vnext = sched_unit2vcpu_cpu(next, cpu);- sched_context_switch(vprev, vnext, now); + sched_context_switch(vprev, vnext, + !is_idle_unit(prev) && is_idle_unit(next), now); }As a minor remark, I think between such constructs it would be good if there was no difference, unless there's a reason to have one. Yet if there was a reason, it surely would want to be spelled out. I guess you mean changing the parameters of sched_context_switch()? I can do that. @@ -124,16 +129,22 @@ unsigned long get_stack_dump_bottom (unsigned long sp); # define CHECK_FOR_LIVEPATCH_WORK "" #endif-#define reset_stack_and_jump(__fn) \+#define switch_stack_and_jump(fn, instr) \Is there any dependency on "instr" to just be a single instruction? I'm inclined to ask for it to be named "extra" or some such. Fine with me. --- a/xen/include/asm-x86/smp.h +++ b/xen/include/asm-x86/smp.h @@ -76,6 +76,9 @@ void set_nr_sockets(void); /* Representing HT and core siblings in each socket. */ extern cpumask_t **socket_cpumask;+#define get_cpu_current(cpu) \+ (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)I don't think this can go without a comment clarifying under what (pretty narrow I think) conditions this is legitimate to use. Okay. I'll add a comment like: "to be used only while no context switch can occur on the cpu". Juergen _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |