|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v3 34/47] xen/sched: add fall back to idle vcpu when scheduling unit
On 14.09.2019 10:52, Juergen Gross wrote:
> When scheduling an unit with multiple vcpus there is no guarantee all
> vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
> idle vcpu of the current cpu in that case. This requires to store the
> correct schedule_unit pointer in the idle vcpu as long as it used as
> fallback vcpu.
>
> In order to modify the runstates of the correct vcpus when switching
> schedule units merge sched_unit_runstate_change() into
> sched_switch_units() and loop over the affected physical cpus instead
> of the unit's vcpus. This in turn requires an access function to the
> current variable of other cpus.
>
> Today context_saved() is called in case previous and next vcpus differ
> when doing a context switch. With an idle vcpu being capable to be a
> substitute for an offline vcpu this is problematic when switching to
> an idle scheduling unit. An idle previous vcpu leaves us in doubt which
> schedule unit was active previously, so save the previous unit pointer
> in the per-schedule resource area. If it is NULL the unit has not
> changed and we don't have to set the previous unit to be not running.
>
> When running an idle vcpu in a non-idle scheduling unit use a specific
> guest idle loop not performing any tasklets and livepatching in order
> to avoid populating the cpu caches with memory used by other domains
> (as far as possible). Softirqs are considered to be save.
Aiui "tasklets" here means only ones run in (idle) vCPU context, whereas
"softirqs" includes tasklets run in softirq context. I think it would
help if the description made this distinction. With this I then wonder
whether the cache related argumentation above still holds: VT-d's IOMMU
fault handling tasklet runs in softirq context, for example, and
hvm_assert_evtchn_irq(), being handed a struct vcpu *, does too. Of
course this could be considered covered by "(as far as possible)" ...
> @@ -172,6 +191,10 @@ void startup_cpu_idle_loop(void)
>
> static void noreturn continue_idle_domain(struct vcpu *v)
> {
> + /* Idle vcpus might be attached to non-idle units! */
> + if ( !is_idle_domain(v->sched_unit->domain) )
> + reset_stack_and_jump_nolp(guest_idle_loop);
The construct and comment make me wonder - did you audit all uses
of is_idle_{domain,vcpu}() for being in line with this new usage
mode?
> @@ -1767,33 +1774,66 @@ static void sched_switch_units(struct sched_resource
> *sd,
> struct sched_unit *next, struct sched_unit
> *prev,
> s_time_t now)
> {
> - sd->curr = next;
> -
> - TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
> prev->unit_id,
> - now - prev->state_entry_time);
> - TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
> next->unit_id,
> - (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
> - (now - next->state_entry_time) : 0, prev->next_time);
> + int cpu;
unsigned int?
> ASSERT(unit_running(prev));
>
> - TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
> - next->domain->domain_id, next->unit_id);
> + if ( prev != next )
> + {
> + sd->curr = next;
> + sd->prev = prev;
>
> - sched_unit_runstate_change(prev, false, now);
> + TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
> + prev->unit_id, now - prev->state_entry_time);
> + TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
> + next->unit_id,
> + (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
> + (now - next->state_entry_time) : 0, prev->next_time);
> + TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
> + next->domain->domain_id, next->unit_id);
>
> - ASSERT(!unit_running(next));
> - sched_unit_runstate_change(next, true, now);
> + ASSERT(!unit_running(next));
>
> - /*
> - * NB. Don't add any trace records from here until the actual context
> - * switch, else lost_records resume will not work properly.
> - */
> + /*
> + * NB. Don't add any trace records from here until the actual context
> + * switch, else lost_records resume will not work properly.
> + */
> +
> + ASSERT(!next->is_running);
> + next->is_running = 1;
> + next->state_entry_time = now;
> +
> + if ( is_idle_unit(prev) )
> + {
> + prev->runstate_cnt[RUNSTATE_running] = 0;
> + prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
> + }
> + if ( is_idle_unit(next) )
> + {
> + next->runstate_cnt[RUNSTATE_running] = sched_granularity;
> + next->runstate_cnt[RUNSTATE_runnable] = 0;
> + }
Is this correct when some of the vCPU-s a substitute idle ones?
> @@ -1849,29 +1889,39 @@ static struct sched_unit *do_schedule(struct
> sched_unit *prev, s_time_t now,
> if ( prev->next_time >= 0 ) /* -ve means no limit */
> set_timer(&sd->s_timer, now + prev->next_time);
>
> - if ( likely(prev != next) )
> - sched_switch_units(sd, next, prev, now);
> + sched_switch_units(sd, next, prev, now);
>
> return next;
> }
>
> -static void context_saved(struct vcpu *prev)
> +static void vcpu_context_saved(struct vcpu *vprev, struct vcpu *vnext)
> {
> - struct sched_unit *unit = prev->sched_unit;
> -
> /* Clear running flag /after/ writing context to memory. */
> smp_wmb();
>
> - prev->is_running = 0;
> + if ( vprev != vnext )
> + vprev->is_running = 0;
> +}
With this "vnext" could be const qualified as it seems. And "false"
instead of "0" perhaps, as you touch this anyway?
> @@ -2015,7 +2079,8 @@ static void sched_slave(void)
>
> pcpu_schedule_unlock_irq(lock, cpu);
>
> - sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
> + sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu),
> + is_idle_unit(next) && !is_idle_unit(prev), now);
> }
>
> /*
> @@ -2075,7 +2140,8 @@ static void schedule(void)
> pcpu_schedule_unlock_irq(lock, cpu);
>
> vnext = sched_unit2vcpu_cpu(next, cpu);
> - sched_context_switch(vprev, vnext, now);
> + sched_context_switch(vprev, vnext,
> + !is_idle_unit(prev) && is_idle_unit(next), now);
> }
As a minor remark, I think between such constructs it would be good
if there was no difference, unless there's a reason to have one. Yet
if there was a reason, it surely would want to be spelled out.
> @@ -124,16 +129,22 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> # define CHECK_FOR_LIVEPATCH_WORK ""
> #endif
>
> -#define reset_stack_and_jump(__fn) \
> +#define switch_stack_and_jump(fn, instr) \
Is there any dependency on "instr" to just be a single instruction?
I'm inclined to ask for it to be named "extra" or some such.
> --- a/xen/include/asm-x86/smp.h
> +++ b/xen/include/asm-x86/smp.h
> @@ -76,6 +76,9 @@ void set_nr_sockets(void);
> /* Representing HT and core siblings in each socket. */
> extern cpumask_t **socket_cpumask;
>
> +#define get_cpu_current(cpu) \
> + (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
I don't think this can go without a comment clarifying under what
(pretty narrow I think) conditions this is legitimate to use.
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |