Xen project Mailing List

Re: [Xen-devel] [PATCH V2 1/1] Improved RTDS scheduler

To: Tianyang Chen <tiche@xxxxxxxxxxxxxx>

Date: Tue, 19 Jan 2016 11:53:03 -0500

Cc: Dario Faggioli <dario.faggioli@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>, Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Tue, 19 Jan 2016 16:53:15 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, Jan 19, 2016 at 11:35 AM, Tianyang Chen <tiche@xxxxxxxxxxxxxx> wrote: > > > > On 1/15/2016 10:33 AM, Meng Xu wrote: >> >> On Wed, Jan 6, 2016 at 2:57 AM, Tianyang Chen <tiche@xxxxxxxxxxxxxx> wrote: >>> >>> >>> >>> >>> On 12/31/2015 8:44 AM, Meng Xu wrote: >>>> >>>> >>>> On Thu, Dec 31, 2015 at 6:20 PM, Tianyang Chen <tiche@xxxxxxxxxxxxxx> >>>> wrote: >>>>> >>>>> >>>>> >>>>> Budget replenishment is now handled by a dedicated timer which is >>>>> triggered at the most imminent release time of all runnable vcpus. >>>>> >>>>> Changes since V1: >>>>> None >>>>> >>>>> Signed-off-by: Tianyang Chen <tiche@xxxxxxxxxxxxxx> >>>>> Signed-off-by: Meng Xu <mengxu@xxxxxxxxxxxxx> >>>>> Signed-off-by: Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx> >>>>> --- >>>>> xen/common/sched_rt.c | 159 >>>>> +++++++++++++++++++++++++++++-------------------- >>>>> 1 file changed, 95 insertions(+), 64 deletions(-) >>>>> >>>>> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c >>>>> index 4372486..d522272 100644 >>>>> --- a/xen/common/sched_rt.c >>>>> +++ b/xen/common/sched_rt.c >>>>> @@ -16,6 +16,7 @@ >>>>> #include <xen/delay.h> >>>>> #include <xen/event.h> >>>>> #include <xen/time.h> >>>>> +#include <xen/timer.h> >>>>> #include <xen/perfc.h> >>>>> #include <xen/sched-if.h> >>>>> #include <xen/softirq.h> >>>>> @@ -147,6 +148,16 @@ static unsigned int nr_rt_ops; >>>>> * Global lock is referenced by schedule_data.schedule_lock from all >>>>> * physical cpus. It can be grabbed via vcpu_schedule_lock_irq() >>>>> */ >>>>> + >>>>> +/* dedicated timer for replenishment */ >>>>> +static struct timer repl_timer; >>>>> + >>>>> +/* controls when to first start the timer*/ >>>> >>>> >>>> >>>> >>>> missing a space between * and / at the end of this line. >>>> >>>> >>>>> >>>>> +static int timer_started; >>>>> + >>>>> +/* handler for the replenishment timer */ >>>>> +static void repl_handler(void *data); >>>>> + >>>>> struct rt_private { >>>>> spinlock_t lock; /* the global coarse grand lock */ >>>>> struct list_head sdom; /* list of availalbe domains, used for >>>>> dump */ >>>>> @@ -426,6 +437,7 @@ __runq_insert(const struct scheduler *ops, struct >>>>> rt_vcpu *svc) >>>>> static int >>>>> rt_init(struct scheduler *ops) >>>>> { >>>>> + const int cpu = smp_processor_id(); >>>> >>>> >>>> >>>> >>>> Did the "cpu" be used? If not, please remove it here. >>>> >>>> >>>>> >>>>> struct rt_private *prv = xzalloc(struct rt_private); >>>>> >>>>> printk("Initializing RTDS scheduler\n" >>>>> @@ -454,6 +466,8 @@ rt_init(struct scheduler *ops) >>>>> >>>>> ops->sched_data = prv; >>>>> >>>>> + init_timer(&repl_timer, repl_handler, ops, 0); >>>>> + >>>>> return 0; >>>>> >>>>> no_mem: >>>>> @@ -473,6 +487,9 @@ rt_deinit(const struct scheduler *ops) >>>>> xfree(_cpumask_scratch); >>>>> _cpumask_scratch = NULL; >>>>> } >>>>> + >>>>> + kill_timer(&repl_timer); >>>>> + >>>>> xfree(prv); >>>>> } >>>>> >>>>> @@ -635,6 +652,13 @@ rt_vcpu_insert(const struct scheduler *ops, struct >>>>> vcpu *vc) >>>>> >>>>> /* add rt_vcpu svc to scheduler-specific vcpu list of the dom */ >>>>> list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu); >>>>> + >>>>> + if(!timer_started) >>>> >>>> >>>> >>>> >>>> space should be added.. >>> >>> >>> >>> Thanks, Meng. I will fix that. >>> >>> >>>> >>>> >>>>> >>>>> + { >>>>> + /* the first vcpu starts the timer for the first time*/ >>>>> + timer_started = 1; >>>>> + set_timer(&repl_timer,svc->cur_deadline); >>>>> + } >>>>> } >>>>> >>>>> /* >>>>> @@ -792,44 +816,6 @@ __runq_pick(const struct scheduler *ops, const >>>>> cpumask_t *mask) >>>>> } >>>>> >>>>> /* >>>>> - * Update vcpu's budget and >>>>> - * sort runq by insert the modifed vcpu back to runq >>>>> - * lock is grabbed before calling this function >>>>> - */ >>>>> -static void >>>>> -__repl_update(const struct scheduler *ops, s_time_t now) >>>>> -{ >>>>> - struct list_head *runq = rt_runq(ops); >>>>> - struct list_head *depletedq = rt_depletedq(ops); >>>>> - struct list_head *iter; >>>>> - struct list_head *tmp; >>>>> - struct rt_vcpu *svc = NULL; >>>>> - >>>>> - list_for_each_safe(iter, tmp, runq) >>>>> - { >>>>> - svc = __q_elem(iter); >>>>> - if ( now < svc->cur_deadline ) >>>>> - break; >>>>> - >>>>> - rt_update_deadline(now, svc); >>>>> - /* reinsert the vcpu if its deadline is updated */ >>>>> - __q_remove(svc); >>>>> - __runq_insert(ops, svc); >>>>> - } >>>>> - >>>>> - list_for_each_safe(iter, tmp, depletedq) >>>>> - { >>>>> - svc = __q_elem(iter); >>>>> - if ( now >= svc->cur_deadline ) >>>>> - { >>>>> - rt_update_deadline(now, svc); >>>>> - __q_remove(svc); /* remove from depleted queue */ >>>>> - __runq_insert(ops, svc); /* add to runq */ >>>>> - } >>>>> - } >>>>> -} >>>>> - >>>>> -/* >>>>> * schedule function for rt scheduler. >>>>> * The lock is already grabbed in schedule.c, no need to lock here >>>>> */ >>>>> @@ -848,7 +834,6 @@ rt_schedule(const struct scheduler *ops, s_time_t >>>>> now, bool_t tasklet_work_sched >>>>> /* burn_budget would return for IDLE VCPU */ >>>>> burn_budget(ops, scurr, now); >>>>> >>>>> - __repl_update(ops, now); >>>>> >>>>> if ( tasklet_work_scheduled ) >>>>> { >>>>> @@ -889,7 +874,7 @@ rt_schedule(const struct scheduler *ops, s_time_t >>>>> now, bool_t tasklet_work_sched >>>>> } >>>>> } >>>>> >>>>> - ret.time = MIN(snext->budget, MAX_SCHEDULE); /* sched quantum */ >>>>> + ret.time = snext->budget; /* invoke the scheduler next time */ >>>>> ret.task = snext->vcpu; >>>>> >>>>> /* TRACE */ >>>>> @@ -1033,10 +1018,6 @@ rt_vcpu_wake(const struct scheduler *ops, struct >>>>> vcpu *vc) >>>>> { >>>>> struct rt_vcpu * const svc = rt_vcpu(vc); >>>>> s_time_t now = NOW(); >>>>> - struct rt_private *prv = rt_priv(ops); >>>>> - struct rt_vcpu *snext = NULL; /* highest priority on RunQ */ >>>>> - struct rt_dom *sdom = NULL; >>>>> - cpumask_t *online; >>>>> >>>>> BUG_ON( is_idle_vcpu(vc) ); >>>>> >>>>> @@ -1074,14 +1055,7 @@ rt_vcpu_wake(const struct scheduler *ops, struct >>>>> vcpu *vc) >>>>> /* insert svc to runq/depletedq because svc is not in queue now */ >>>>> __runq_insert(ops, svc); >>>>> >>>>> - __repl_update(ops, now); >>>>> - >>>>> - ASSERT(!list_empty(&prv->sdom)); >>>>> - sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem); >>>>> - online = cpupool_scheduler_cpumask(sdom->dom->cpupool); >>>>> - snext = __runq_pick(ops, online); /* pick snext from ALL valid cpus >>>>> */ >>>>> - >>>>> - runq_tickle(ops, snext); >>>>> + runq_tickle(ops, svc); >>>>> >>>>> return; >>>>> } >>>>> @@ -1094,10 +1068,6 @@ static void >>>>> rt_context_saved(const struct scheduler *ops, struct vcpu *vc) >>>>> { >>>>> struct rt_vcpu *svc = rt_vcpu(vc); >>>>> - struct rt_vcpu *snext = NULL; >>>>> - struct rt_dom *sdom = NULL; >>>>> - struct rt_private *prv = rt_priv(ops); >>>>> - cpumask_t *online; >>>>> spinlock_t *lock = vcpu_schedule_lock_irq(vc); >>>>> >>>>> clear_bit(__RTDS_scheduled, &svc->flags); >>>>> @@ -1108,15 +1078,8 @@ rt_context_saved(const struct scheduler *ops, >>>>> struct vcpu *vc) >>>>> if ( test_and_clear_bit(__RTDS_delayed_runq_add, &svc->flags) && >>>>> likely(vcpu_runnable(vc)) ) >>>>> { >>>>> + /* only insert the pre-empted vcpu back*/ >>>>> __runq_insert(ops, svc); >>>>> - __repl_update(ops, NOW()); >>>>> - >>>>> - ASSERT(!list_empty(&prv->sdom)); >>>>> - sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem); >>>>> - online = cpupool_scheduler_cpumask(sdom->dom->cpupool); >>>>> - snext = __runq_pick(ops, online); /* pick snext from ALL cpus */ >>>>> - >>>>> - runq_tickle(ops, snext); >>>>> } >>>>> out: >>>>> vcpu_schedule_unlock_irq(lock, vc); >>>>> @@ -1167,6 +1130,74 @@ rt_dom_cntl( >>>>> return rc; >>>>> } >>>>> >>>>> +static void repl_handler(void *data){ >>>>> + unsigned long flags; >>>>> + s_time_t now = NOW(); >>>>> + s_time_t min_repl = LONG_MAX; /* max time used in comparison*/ >>>>> + struct scheduler *ops = data; >>>>> + struct rt_private *prv = rt_priv(ops); >>>>> + struct list_head *runq = rt_runq(ops); >>>>> + struct list_head *depletedq = rt_depletedq(ops); >>>>> + struct list_head *iter; >>>>> + struct list_head *tmp; >>>>> + struct rt_vcpu *svc = NULL; >>>>> + >>>>> + spin_lock_irqsave(&prv->lock,flags); >>>>> + >>>>> + stop_timer(&repl_timer); >>>> >>>> >>>> >>>> >>>> Move the stop_timer before spin_lock_irqsave() since no more than one >>>> core will run this handler in the current design. >>> >>> >>> >>> Right. We want the critical section as short as possible. >>> >>>> >>>>> >>>>> >>>>> >>>>> + >>>>> + list_for_each_safe(iter, tmp, runq) >>>>> + { >>>>> + svc = __q_elem(iter); >>>>> + if ( now < svc->cur_deadline ) >>>>> + break; >>>>> + rt_update_deadline(now, svc); >>>>> + /* scan the runq to find the min release time >>>>> + * this happens when vcpus on runq miss deadline >>>>> + */ >>>>> + if( min_repl> svc->cur_deadline ) >>>>> + { >>>>> + min_repl = svc->cur_deadline; >>>>> + } >>>> >>>> >>>> >>>> The logic here is incorrect. The comment you put is correct, but the >>>> code didn't reflect the comment. >>>> >>>> When the timer's handler is called, now should be equal to min_repl, >>>> shouldn't it? >>>> If now = min_repl, this if statement will never be called because the >>>> loop stops if ( now < svc->cur_deadline ). >>>> >>> >>> Yes the handler will be invoked when it was programmed to but the value of >>> min_repl is set to be LONG_MAX initially. It doesn't preserve values. >>> Therefore, the IF statement is feasible. >> >> >> >> Hmm, let's have a look at the logic... >> >> the min_repl. will only be possibly updated to svc->cur_deadline when >> now >= svc->cur_deadline, right? >> >> In other words, min_repl will only be possibly updated to a time >> point before now. Shouldn't the min_repl be the earliest time *in the >> future* that the handler should be triggered? If not, what is the >> definition/meaning of min_repl here? >> >> At the end of the function, you did reprogram the timer to min_repl. >> /* reprogram the timer using the most imminent release time*/ >> set_timer(&repl_timer, min_repl); >> >> So I think the logic is incorrect here... >> >> You should first update the budget and deadline of each VCPUs in runq, >> depletedq and running VCPUs, and (then) find the earliest time in the >> future when the handler should be triggered. >> > > Yeah. Although the next release time was actually found after a vcpu was > replenished. If the runq still has vcpus pending but not missing a deadline, > the break statement on the first iteration will not check the release time. > > Since the runq is sorted, it's easier to just pick the next one after both > runq and depletedq are updated. > > >>> >>> >>> >>>> Only after you have replenished the vcpus on runq and depletedq, >>>> should you pick the earliest deadline of the front vcpus at the runq >>>> and depletedq as the min_repl. >>>> >>>> Please correct me if I missed anything. :-) >>>> >>>> >>>>> >>>>> + /* reinsert the vcpu if its deadline is updated */ >>>>> + __q_remove(svc); >>>>> + __runq_insert(ops, svc); >>>>> + } >>>>> + >>>>> + list_for_each_safe(iter, tmp, depletedq) >>>>> + { >>>>> + svc = __q_elem(iter); >>>>> + if ( now >= svc->cur_deadline ) >>>>> + { >>>>> + rt_update_deadline(now, svc); >>>>> + >>>>> + /* scan the delp_q to find the minimal release time */ >>>>> + if(min_repl > svc->cur_deadline) >>>>> + { >>>>> + min_repl = svc->cur_deadline; >>>>> + } >>>> >>>> >>>> >>>> >>>> Please see the above comment. It may have the same issue here. >>>> >>>> In addition, the current running VCPUs on cores also need to replenish >>>> their budgets at the beginning of their next periods. Those running >>>> VCPUs are not in runq or depletedq in the current RTDS scheduler. So >>>> this patch won't replenish the current running VCPUs' budget until >>>> they are put back to runq or depletedq, which is incorrect. Think >>>> about the following scenario: a VCPU with very small period has a >>>> task always running on it to keep this vcpu always runnable; This VCPU >>>> has its period equal to its budget. Suppose this VCPU is the only VCPU >>>> on a 4 core machine. This VCPU should keep running on one core and >>>> never be put back to runq. In the current code, this VCPU won't have >>>> its budget replenished. Right? :-) >>>> >>>> You can choose one of the following three approaches to fix it: >>>> a) add the running VCPUs into the runq >>>> b) use another queue to keep track of the current running VCPUs >>>> c) just scan each core (in the cpupool) to get the current running VCPU on >>>> it. >>>> I personally prefer c) but I'm looking forward to others' opinion. :-) >>>> > > I experimented a little bit with adding a runningq to keep track of all > running vcpus. A vcpu will be inserted when it's picked as snext in > rt_schedule(). It will be removed from runningq after the context switch is > finished in context_saved(). Also, a currently running vcpu could be marked > as RUNSTATE_offline in sleep() and added back to runq in wake(). Does this > sound reasonable if we were to use a q to keep track of running vcpus? I think so. The name of that queue can be runningq, IMHO. > > > >>>>> + __q_remove(svc); >>>>> + __runq_insert(ops, svc); >>>>> + runq_tickle(ops, svc); >>>>> + } >>>>> + } >>>>> + >>>>> + /* if timer was triggered but none of the vcpus >>>>> + * need to be refilled, set the timer to be the >>>>> + * default period + now >>>>> + */ >>>> >>>> >>>> >>>> >>>> Hmm, when could this happen? If it could happen, please specify the >>>> situation. >>>> Otherwise, it may be better to print a warning (or cause crash?) when >>>> you set the timer to be default period + now? >>>> >>>> I'm not quite sure which is better, warning or crash? >>>> >>>> >>>>> >>>>> + if(min_repl == LONG_MAX) >>>>> + { >>>>> >>>>> + set_timer(&repl_timer, now + RTDS_DEFAULT_PERIOD); >>>>> + } >>>> >>>> >>>> >>>> >>>> Hmm, >>>> >>>> when could this happen? The only situation in my mind is that there is >>>> no vcpus in the cpupool. But if that is the case, shouldn't the timer >>>> never be triggered? Then this handler will never be called. >>> >>> >>> >>> This is actually still a question for me. When I was testing the code in a >>> virtual machine, it happens when the system boots up. I am guessing it's >>> related to the same problem in nested-virtualization (just guessing). When >>> I tested it on a real machine, it locks up and won't even respond to any >>> key-board events. >> >> >> >> >> First, we care about the real machine scenario. It should always run >> well on real machine. The scheduler design didn't consider the >> nest-virt. situation, so it is ok for now to have performance >> degradation or issues in nest-virt. situation, IMHO. >> >> Second, the reason why it locks up is probably because the timer of >> the replenish handler is not correctly set up as I pointed out above. >> So try to fix the above issue first and see if this still occurs. >> > Right, sometimes there is no vcpus on q at all because when the timer is > programmed to trigger, a running vcpu hasn't been added back to the q yet. In > this situation, correct me if I'm wrong, the repl_handler is called before > the context switch is finished. However, scanning the currently running vcpus > can fix this. Yes. Meng _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.