[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 1/4] xen: add real time scheduler rtds



On 09/14/2014 10:37 PM, Meng Xu wrote:
This scheduler follows the Preemptive Global Earliest Deadline First
(EDF) theory in real-time field.
At any scheduling point, the VCPU with earlier deadline has higher
priority. The scheduler always picks the highest priority VCPU to run on a
feasible PCPU.
A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is
idle or has a lower-priority VCPU running on it.)

Each VCPU has a dedicated period and budget.
The deadline of a VCPU is at the end of each period;
A VCPU has its budget replenished at the beginning of each period;
While scheduled, a VCPU burns its budget.
The VCPU needs to finish its budget before its deadline in each period;
The VCPU discards its unused budget at the end of each period.
If a VCPU runs out of budget in a period, it has to wait until next period.

Each VCPU is implemented as a deferable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Queue scheme:
A global runqueue and a global depletedq for each CPU pool.
The runqueue holds all runnable VCPUs with budget and sorted by deadline;
The depletedq holds all VCPUs without budget and unsorted.

Note: cpumask and cpupool is supported.

This is an experimental scheduler.

Signed-off-by: Meng Xu <mengxu@xxxxxxxxxxxxx>
Signed-off-by: Sisu Xi <xisisu@xxxxxxxxx>

Getting there, but unfortunately I've got a number more comments.

Konrad, I think this is very close to being ready -- when is the deadline again, and how hard is it? Would it be better to check it in before the deadline, and then address the things I'm bringing up here? Or would it be better to wait until all the issues are sorted and then check it in (even if it's after the deadline)?

+/*
+ * Debug related code, dump vcpu/cpu information
+ */
+static void
+rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
+{
+    char cpustr[1024];
+    cpumask_t *cpupool_mask;
+
+    ASSERT(svc != NULL);
+    /* flag vcpu */
+    if( svc->sdom == NULL )
+        return;
+
+    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
+    printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
+           " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime"\n"
+           " \t\t onQ=%d runnable=%d cpu_hard_affinity=%s ",
+            svc->vcpu->domain->domain_id,
+            svc->vcpu->vcpu_id,
+            svc->vcpu->processor,
+            svc->period,
+            svc->budget,
+            svc->cur_budget,
+            svc->cur_deadline,
+            svc->last_start,
+            __vcpu_on_q(svc),
+            vcpu_runnable(svc->vcpu),
+            cpustr);
+    memset(cpustr, 0, sizeof(cpustr));
+    cpupool_mask = cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_mask);
+    printk("cpupool=%s\n", cpustr);
+}
+
+static void
+rt_dump_pcpu(const struct scheduler *ops, int cpu)
+{
+    struct rt_vcpu *svc = rt_vcpu(curr_on_cpu(cpu));
+
+    rt_dump_vcpu(ops, svc);
+}


These svc structures are allocated dynamically and may disappear at any time... I think you need the lock to cover this as well.

And since this is called both externally (via ops->dump_cpu_state) and internally (below), you probably need a locking version and a non-locking version (normally you'd make rt_dump_pcpu() a locking "wrapper" around __rt_dump_pcpu()).

+
+static void
+rt_dump(const struct scheduler *ops)
+{
+    struct list_head *iter_sdom, *iter_svc, *runq, *depletedq, *iter;
+    struct rt_private *prv = rt_priv(ops);
+    struct rt_vcpu *svc;
+    unsigned int cpu = 0;
+    cpumask_t *online;
+    struct rt_dom *sdom;
+    unsigned long flags;
+
+    ASSERT(!list_empty(&prv->sdom));
+    sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
+    online = cpupool_scheduler_cpumask(sdom->dom->cpupool);
+    runq = rt_runq(ops);
+    depletedq = rt_depletedq(ops);

Same thing with all these -- other CPUs may be modifying prv->sdom.

+
+    printk("PCPU info:\n");
+    for_each_cpu(cpu, online)
+        rt_dump_pcpu(ops, cpu);
+
+    printk("Global RunQueue info:\n");
+    spin_lock_irqsave(&prv->lock, flags);
+    list_for_each( iter, runq )
+    {
+        svc = __q_elem(iter);
+        rt_dump_vcpu(ops, svc);
+    }
+
+    printk("Global DepletedQueue info:\n");
+    list_for_each( iter, depletedq )
+    {
+        svc = __q_elem(iter);
+        rt_dump_vcpu(ops, svc);
+    }
+
+    printk("Domain info:\n");
+    list_for_each( iter_sdom, &prv->sdom )
+    {
+        sdom = list_entry(iter_sdom, struct rt_dom, sdom_elem);
+        printk("\tdomain: %d\n", sdom->dom->domain_id);
+
+        list_for_each( iter_svc, &sdom->vcpu )
+        {
+            svc = list_entry(iter_svc, struct rt_vcpu, sdom_elem);
+            rt_dump_vcpu(ops, svc);
+        }
+    }
+
+    spin_unlock_irqrestore(&prv->lock, flags);
+}
+
+/*
+ * update deadline and budget when now >= cur_deadline
+ * it need to be updated to the deadline of the current period
+ */
+static void
+rt_update_deadline(s_time_t now, struct rt_vcpu *svc)
+{
+    ASSERT(now >= svc->cur_deadline);
+    ASSERT(svc->period != 0);
+
+    do
+        svc->cur_deadline += svc->period;
+    while ( svc->cur_deadline <= now );

Is there any possibility that his loop could run on for an unusually long time? Suppose a vcpu ran, then was put to sleep for a few months, then woke up again? Or suppose that there was some other condition that could be triggered that would make this take a large number of iterations -- it might end up with a DoS at some point down the line.

Would it make sense to add some kind of a failsafe here to fall back to the slow method (using division and multiplication) if the difference is too large? We should be able to make the check really simple, like this:

if ( svc->cur_deadline + (svc->period << UPDATE_LIMIT_SHIFT) > now )
 [while loop];
else
 [multiplication and division];

Where UPDATE_LIMIT_SHIFT could be something that would limit the loop to a reasonable number of iterations; say, 10 (which would give you 1024 iterations).

Another option would be to scrap the multiplication and addition altogether, and just set svc->cur_deadline = now + svc->period. If so much time has passed, it's unlikely that being aligned will be critical.

Another advantage of this is that we could use this update function both here and below when creating the vcpu (and setting the initial deadline).

Thoughts?

This isn't critical, so if we can't come to a consensus quickly it's probably fine to go in as it is; but if nobody has any objections, I think we should put in some kind of a check like this.

+
+    svc->cur_budget = svc->budget;
+
+    /* TRACE */
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned cur_deadline_lo, cur_deadline_hi;
+            unsigned cur_budget_lo, cur_budget_hi;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.cur_deadline_lo = (unsigned) svc->cur_deadline;
+        d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
+        d.cur_budget_lo = (unsigned) svc->cur_budget;
+        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
+        trace_var(TRC_RTDS_BUDGET_REPLENISH, 1,
+                  sizeof(d),
+                  (unsigned char *) &d);
+    }
+
+    return;
+}
+
[snip]

+/*
+ * This function is called in sched_move_domain() in schedule.c
+ * When move a domain to a new cpupool.
+ * It inserts vcpus of moving domain to the scheduler's RunQ in
+ * dest. cpupool; and insert rt_vcpu svc to scheduler-specific
+ * vcpu list of the dom
+ */
+static void
+rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_private *prv = rt_priv(ops);
+    struct rt_vcpu *svc = rt_vcpu(vc);
+
+    /* not addlocate idle vcpu to dom vcpu list */
+    if ( is_idle_vcpu(vc) )
+        return;
+
+    if ( !__vcpu_on_q(svc) && vcpu_runnable(vc) && !vc->is_running )
+    {
+        if ( svc->cur_budget > 0 )
+            __runq_insert(ops, svc);
+        else
+            list_add(&svc->q_elem, &prv->depletedq);

This check is done in __runq_insert(), you don't need to duplicate it here.

[snip]
+/*
+ * schedule function for rt scheduler.
+ * The lock is already grabbed in schedule.c, no need to lock here
+ */
+static struct task_slice
+rt_schedule(const struct scheduler *ops, s_time_t now, bool_t 
tasklet_work_scheduled)
+{
+    const int cpu = smp_processor_id();
+    struct rt_private *prv = rt_priv(ops);
+    struct rt_vcpu *const scurr = rt_vcpu(current);
+    struct rt_vcpu *snext = NULL;
+    struct task_slice ret = { .migrated = 0 };
+
+    /* clear ticked bit now that we've been scheduled */
+    cpumask_clear_cpu(cpu, &prv->tickled);
+
+    /* burn_budget would return for IDLE VCPU */
+    burn_budget(ops, scurr, now);
+
+    __repl_update(ops, now);
+
+    if ( tasklet_work_scheduled )
+    {
+        snext = rt_vcpu(idle_vcpu[cpu]);
+    }
+    else
+    {
+        cpumask_t cur_cpu;
+        cpumask_clear(&cur_cpu);
+        cpumask_set_cpu(cpu, &cur_cpu);
+        snext = __runq_pick(ops, &cur_cpu);
+        if ( snext == NULL )
+            snext = rt_vcpu(idle_vcpu[cpu]);
+
+        /* if scurr has higher priority and budget, still pick scurr */
+        if ( !is_idle_vcpu(current) &&
+             vcpu_runnable(current) &&
+             scurr->cur_budget > 0 &&
+             ( is_idle_vcpu(snext->vcpu) ||
+               scurr->cur_deadline <= snext->cur_deadline ) )
+            snext = scurr;
+    }
+
+    if ( snext != scurr &&
+         !is_idle_vcpu(current) &&
+         vcpu_runnable(current) )
+        set_bit(__RTDS_delayed_runq_add, &scurr->flags);
+
+    snext->last_start = now;
+    if ( !is_idle_vcpu(snext->vcpu) )
+    {
+        if ( snext != scurr )
+        {
+            __q_remove(snext);
+            set_bit(__RTDS_scheduled, &snext->flags);
+        }
+        if ( snext->vcpu->processor != cpu )
+        {
+            snext->vcpu->processor = cpu;
+            ret.migrated = 1;
+        }
+    }
+
+    ret.time = MILLISECS(1); /* sched quantum */
+    ret.task = snext->vcpu;

Isn't having a fixed 1ms quantum going to break with microsecond-granularity period and bugdet? If someone sets their period sub-millisecond, won't it be allowed to run for a full ms (even though its budget is smaller), and then, because the period has passed, be given a new budget again? This will allow vcpus with sub-millisecond periods to starve out those with more, won't it? I'm not sure how two sub-millisecond-period vms will interact, but I doubt it would be what anybody's expecting.

What about #define MAX_SCHEDULE MILLISECS(1), then set ret.time = MIN(snext->bugdet, MAX_SCHEDULE)

It might also make sense to do some kind of experiments to determine a minimum reliable period / budget and not allow people to set values below that.

(We could consider allowing MAX_SCHEDULE to be configurable at some point as well.)

+
+    /* TRACE */
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned cur_deadline_lo, cur_deadline_hi;
+            unsigned cur_budget_lo, cur_budget_hi;
+        } d;
+        d.dom = snext->vcpu->domain->domain_id;
+        d.vcpu = snext->vcpu->vcpu_id;
+        d.cur_deadline_lo = (unsigned) snext->cur_deadline;
+        d.cur_deadline_hi = (unsigned) (snext->cur_deadline >> 32);
+        d.cur_budget_lo = (unsigned) snext->cur_budget;
+        d.cur_budget_hi = (unsigned) (snext->cur_budget >> 32);
+        trace_var(TRC_RTDS_SCHED_TASKLET, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
+    return ret;
+}
+
+/*
+ * Remove VCPU from RunQ
+ * The lock is already grabbed in schedule.c, no need to lock here
+ */
+static void
+rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = rt_vcpu(vc);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( curr_on_cpu(vc->processor) == vc )
+        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+    else if ( __vcpu_on_q(svc) )
+        __q_remove(svc);
+    else if ( test_bit(__RTDS_delayed_runq_add, &svc->flags) )
+        clear_bit(__RTDS_delayed_runq_add, &svc->flags);
+}
+
+/*
+ * Pick a cpu where to run a vcpu, possibly kicking out the vcpu running there
+ * Called by wake() and context_saved()
+ * We have a running candidate here, the kick logic is:
+ * Among all the cpus that are within the cpu affinity
+ * 1) if the new->cpu is idle, kick it. This could benefit cache hit
+ * 2) if there are any idle vcpu, kick it.
+ * 3) now all pcpus are busy, among all the running vcpus, pick lowest 
priority one
+ *    if snext has higher priority, kick it.
+ *
+ * TODO:
+ * 1) what if these two vcpus belongs to the same domain?
+ *    replace a vcpu belonging to the same domain introduces more overhead
+ *
+ * lock is grabbed before calling this function
+ */
+static void
+runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
+{
+    struct rt_private *prv = rt_priv(ops);
+    struct rt_vcpu *latest_deadline_vcpu = NULL;    /* lowest priority 
scheduled */
+    struct rt_vcpu *iter_svc;
+    struct vcpu *iter_vc;
+    int cpu = 0, cpu_to_tickle = 0;
+    cpumask_t not_tickled;
+    cpumask_t *online;
+
+    if ( new == NULL || is_idle_vcpu(new->vcpu) )
+        return;
+
+    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
+    cpumask_and(&not_tickled, online, new->vcpu->cpu_hard_affinity);
+    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
+
+    /* 1) if new's previous cpu is idle, kick it for cache benefit */
+    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) )
+    {
+        cpu_to_tickle = new->vcpu->processor;
+        goto out;
+    }
+
+    /* 2) if there are any idle pcpu, kick it */
+    /* The same loop also find the one with lowest priority */
+    for_each_cpu(cpu, &not_tickled)
+    {
+        iter_vc = curr_on_cpu(cpu);
+        if ( is_idle_vcpu(iter_vc) )
+        {
+            cpu_to_tickle = cpu;
+            goto out;
+        }
+        iter_svc = rt_vcpu(iter_vc);
+        if ( latest_deadline_vcpu == NULL ||
+             iter_svc->cur_deadline > latest_deadline_vcpu->cur_deadline )
+            latest_deadline_vcpu = iter_svc;
+    }
+
+    /* 3) candicate has higher priority, kick out the lowest priority vcpu */
+    if ( latest_deadline_vcpu != NULL && new->cur_deadline < 
latest_deadline_vcpu->cur_deadline )
+    {
+        cpu_to_tickle = latest_deadline_vcpu->vcpu->processor;
+        goto out;
+    }
+
+out:
+    /* TRACE */
+    {
+        struct {
+            unsigned cpu:8, pad:24;
+        } d;
+        d.cpu = cpu_to_tickle;
+        d.pad = 0;
+        trace_var(TRC_RTDS_TICKLE, 0,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
+    cpumask_set_cpu(cpu_to_tickle, &prv->tickled);
+    cpu_raise_softirq(cpu_to_tickle, SCHEDULE_SOFTIRQ);

Won't the "fall-through" here mean that if it doesn't find a suitable cpu to place next, that it will *always* pointlessly tickle cpu 0 (the initial value of cpu_to_tickle)?

It seems like you should either put a return before the out: label (possibly with another trace to say the tickle didn't do anything), or set cpu_to_tickle to -1 and then only set the tickled bit and raise the softirq if cpu_to_tickle >= 0.

Also, I think at the moment Xen supports up to 4096 CPUs on some systems, so you should probably make cpu at least 16 bits (if not just a full word).

+    return;
+}
+
+/*
+ * Should always wake up runnable vcpu, put it back to RunQ.
+ * Check priority to raise interrupt
+ * The lock is already grabbed in schedule.c, no need to lock here
+ * TODO: what if these two vcpus belongs to the same domain?
+ */
+static void
+rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = rt_vcpu(vc);
+    s_time_t now = NOW();
+    struct rt_private *prv = rt_priv(ops);
+    struct rt_vcpu *snext = NULL;        /* highest priority on RunQ */
+    struct rt_dom *sdom = NULL;
+    cpumask_t *online;
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( unlikely(curr_on_cpu(vc->processor) == vc) )
+        return;
+
+    /* on RunQ/DepletedQ, just update info is ok */
+    if ( unlikely(__vcpu_on_q(svc)) )
+        return;
+
+    /* If context hasn't been saved for this vcpu yet, we can't put it on
+     * the Runqueue/DepletedQ. Instead, we set a flag so that it will be
+     * put on the Runqueue/DepletedQ after the context has been saved.
+     */
+    if ( unlikely(test_bit(__RTDS_scheduled, &svc->flags)) )
+    {
+        set_bit(__RTDS_delayed_runq_add, &svc->flags);
+        return;
+    }
+
+    if ( now >= svc->cur_deadline)
+        rt_update_deadline(now, svc);
+
+    /* insert svc to runq/depletedq because svc is not in queue now */
+    if ( svc->cur_budget > 0 )
+        __runq_insert(ops, svc);
+    else
+        list_add(&svc->q_elem, &prv->depletedq);

This check is already done inside of __runq_insert() -- you don't need to duplicate it here.

+
+    __repl_update(ops, now);
+
+    ASSERT(!list_empty(&prv->sdom));
+    sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
+    online = cpupool_scheduler_cpumask(sdom->dom->cpupool);
+    snext = __runq_pick(ops, online);    /* pick snext from ALL valid cpus */
+
+    runq_tickle(ops, snext);

Is there a need to check the status of the "front of the queue" on every wake? In theory the only vcpu which we want to check is the one we just woke up. Would it make sense instead to do something like if(snext == svc) runq_tickle()?

I was about to say that since our scheduling quantum is so short (1ms), it almost seems like we could get away without tickling entirely; or, that we could just check to see if it's got a high enough priority to run on the cpu it's currently running on, and just run it there, without checking everywhere else.

But then I realized that you can specify budget and period in microseconds, which would completely not work without tickling; that's when I realized the problem with us-granularity period with a 1-ms scheduling quantum.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.