[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v1 1/4] xen: add real time scheduler rt



On dom, 2014-08-24 at 18:58 -0400, Meng Xu wrote:
> The mechanism of how to burn a VCPU's budget depends on the server mechanism
> implemented for each VCPU.
> 
> Server mechanism: a VCPU is implemented as a deferable server.
> When a VCPU is scheduled to execute on a PCPU, its budget is continuously
> burned.
> 
Right now, we only support one 'server mechanism', i.e., the one
introduce in this very patch. I appreciate that you're trying to
highlight the fact that the budget burning and replenishment logic is
flexible and easy to change/add another one, but, as of now, I fear this
is more confusing than anything else.

So, I'd say, just kill the first two lines above, avoid mentioning
'Server mechanism:' and go ahead describing how the budget is burned. It
will be when we add another policy, that we'll explain the analogies and
the differences, now it's rather pointless.

> Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> At any scheduling point, the VCPU with earliest deadline has highest
> priority.
> 
Same here. Put this paragraph above, when you first mention EDF, as a
quick explanation of what EDF is (e.g., below the first line "This
scheduler follows the pre-emptive Global EDF theory in real-time
field."). Again, since for now we support only that priority scheme, no
point in hinting at other ones.

Oh, and one thing that may be good to have here, is a few words about
how the budget and period of a vcpu relates to its scheduling deadline. 
In fact, as the changelog stands, one reads about a vcpu having a period
and a budget, and being scheduled according to its deadline, without any
indication of how it is given one! :-)

It's evident to you and me, because we know the algorithm very well, but
may be useful to someone looking at it for the first time and trying to
understand what's happening.

> Scheduling quantum: 1 ms;
> 
What does this mean? I mean, what effect does this have on the
algorithm, as it's described above? It's the granularity of budget
accounting, right? Either state it, or avoid mentioning the information
at all.

> This is still in the development phase.
> 
Ditto (in another message) about this.

> --- /dev/null
> +++ b/xen/common/sched_rt.c
> @@ -0,0 +1,1205 @@
> +/******************************************************************************
> + * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
> + * EDF scheduling is one of most popular real-time scheduling algorithm used 
> in
> + * embedded field.
> + *
Is it? :-)

It's true it existed since 1973, and it's a very effective, efficient
and, despite its age, advanced one, but I'm not sure how broadly it is
adopted in practice.

Actually, that's what makes this very cool: we're going to be among the
few platforms that ships it! :-P

> +/*
> + * Design:
> + *
> + * This scheduler follows the Preemptive Global EDF theory in real-time 
> field.
> + * Each VCPU can have a dedicated period and budget. 
> + * While scheduled, a VCPU burns its budget.
> + * A VCPU has its budget replenished at the beginning of each of its periods;
> + * The VCPU discards its unused budget at the end of each of its periods.
> + * If a VCPU runs out of budget in a period, it has to wait until next 
> period.
> + * The mechanism of how to burn a VCPU's budget depends on the server 
> mechanism
> + * implemented for each VCPU.
> + *
> + * Server mechanism: a VCPU is implemented as a deferable server.
> + * When a VCPU has a task running on it, its budget is continuously burned;
> + * When a VCPU has no task but with budget left, its budget is preserved.
> + *
> + * Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> + * At any scheduling point, the VCPU with earliest deadline has highest 
> priority.
> + *
> + * Queue scheme: A global runqueue for each CPU pool. 
> + * The runqueue holds all runnable VCPUs. 
> + * VCPUs in the runqueue are divided into two parts: with and without 
> remaining budget. 
> + * At each part, VCPUs are sorted based on EDF priority scheme.
> + *
> + * Scheduling quanta: 1 ms; but accounting the budget is in microsecond.
> + *
> + * Note: cpumask and cpupool is supported.
> + */
> +
Of course, all I said about the changelog, applies here too.

> +/*
> + * Locking:
> + * Just like credit2, a global system lock is used to protect the RunQ.
> + * The global lock is referenced by schedule_data.schedule_lock from all 
> physical cpus.
> + *
Well, credit2 has one RunQ per socket (or so it should, bugs [that I'm
trying to fix] apart). In any case, why referencing it, just go ahead
describing your solution. :-)

> +#define RT_DEFAULT_PERIOD     (MICROSECS(10))
> +#define RT_DEFAULT_BUDGET     (MICROSECS(4))
> +
RT_DS? Maybe not so important for now, though, as it's entirely an
implementation detail.

> +/*
> + * Virtual CPU
> + */
> +struct rt_vcpu {
> +    struct list_head runq_elem; /* On the runqueue list */
> +    struct list_head sdom_elem; /* On the domain VCPU list */
> +
> +    /* Up-pointers */
> +    struct rt_dom *sdom;
> +    struct vcpu *vcpu;
> +
> +    /* VCPU parameters, in milliseconds */
> +    s_time_t period;
> +    s_time_t budget;
> +
> +    /* VCPU current infomation in nanosecond */
> +    long cur_budget;             /* current budget */
>
Perhaps we said this already, in which case, remind me why cur_budget is
a long and not an s_time_t as the other params? It's a time value...

> +/*
> + * Useful inline functions
> + */
> +static inline struct rt_private *RT_PRIV(const struct scheduler *_ops)
> +{
> +    return _ops->sched_data;
> +}
> +
> +static inline struct rt_vcpu *RT_VCPU(const struct vcpu *_vcpu)
> +{
> +    return _vcpu->sched_priv;
> +}
> +
> +static inline struct rt_dom *RT_DOM(const struct domain *_dom)
> +{
> +    return _dom->sched_priv;
> +}
> +
> +static inline struct list_head *RUNQ(const struct scheduler *_ops)
> +{
> +    return &RT_PRIV(_ops)->runq;
> +}
> +
I see what's happening, and I remember the suggestion of using static
inline-s, with which I agree. At that point, however, I'd have the
function names lower case-ed.

It's probably not a big deal, and I don't think we have any official
saying about this, but it looks more consistent to me.

Oh and, also, since they're functions now, no need for the '_' in
arguments' names.

> +//#define RT_PRIV(_ops)     ((struct rt_private *)((_ops)->sched_data))
> +//#define RT_VCPU(_vcpu)    ((struct rt_vcpu *)(_vcpu)->sched_priv)
> +//#define RT_DOM(_dom)      ((struct rt_dom *)(_dom)->sched_priv)
> +//#define RUNQ(_ops)              (&RT_PRIV(_ops)->runq)
> +
These obviously need to go away. :-)

> +/*
> + * RunQueue helper functions
> + */
> +static int
> +__vcpu_on_runq(const struct rt_vcpu *svc)
> +{
> +   return !list_empty(&svc->runq_elem);
> +}
> +
> +static struct rt_vcpu *
> +__runq_elem(struct list_head *elem)
> +{
> +    return list_entry(elem, struct rt_vcpu, runq_elem);
> +}
> +
> +/*
> + * Debug related code, dump vcpu/cpu information
> + */
> +static void
> +rt_dump_vcpu(const struct rt_vcpu *svc)
> +{
> +    char cpustr[1024];
> +
> +    ASSERT(svc != NULL);
> +    /* flag vcpu */
> +    if( svc->sdom == NULL )
> +        return;
> +
> +    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
> +    printk("[%5d.%-2u] cpu %u, (%"PRId64", %"PRId64"), cur_b=%"PRId64" 
> cur_d=%"PRId64" last_start=%"PRId64" onR=%d runnable=%d cpu_hard_affinity=%s 
> ",
>
long line.

Also, we have PRI_stime (look in xen/include/xen/time.h).

> +            svc->vcpu->domain->domain_id,
> +            svc->vcpu->vcpu_id,
> +            svc->vcpu->processor,
> +            svc->period,
> +            svc->budget,
> +            svc->cur_budget,
> +            svc->cur_deadline,
> +            svc->last_start,
> +            __vcpu_on_runq(svc),
> +            vcpu_runnable(svc->vcpu),
> +            cpustr);
> +    memset(cpustr, 0, sizeof(cpustr));
> +    cpumask_scnprintf(cpustr, sizeof(cpustr), 
> cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool));
here too.

> +    printk("cpupool=%s\n", cpustr);
> +
> +    /* TRACE */
> +    {
> +        struct {
> +            unsigned dom:16,vcpu:16;
> +            unsigned processor;
> +            unsigned cur_budget_lo, cur_budget_hi, cur_deadline_lo, 
> cur_deadline_hi;
> +            unsigned is_vcpu_on_runq:16,is_vcpu_runnable:16;
> +        } d;
> +        d.dom = svc->vcpu->domain->domain_id;
> +        d.vcpu = svc->vcpu->vcpu_id;
> +        d.processor = svc->vcpu->processor;
> +        d.cur_budget_lo = (unsigned) svc->cur_budget;
> +        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> +        d.cur_deadline_lo = (unsigned) svc->cur_deadline;
> +        d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
> +        d.is_vcpu_on_runq = __vcpu_on_runq(svc);
> +        d.is_vcpu_runnable = vcpu_runnable(svc->vcpu);
> +        trace_var(TRC_RT_VCPU_DUMP, 1,
> +                  sizeof(d),
> +                  (unsigned char *)&d);
> +    }
Not a too big deal, but is it really useful to trace vcpu dumps? How?

> +/*
> + * should not need lock here. only showing stuff 
> + */
> +static void
> +rt_dump(const struct scheduler *ops)
> +{
> +    struct list_head *iter_sdom, *iter_svc, *runq, *iter;
> +    struct rt_private *prv = RT_PRIV(ops);
> +    struct rt_vcpu *svc;
> +    unsigned int cpu = 0;
> +    unsigned int loop = 0;
> +
> +    printtime();
> +    printk("Priority Scheme: EDF\n");
> +
No value in printing this.

> +/*
> + * Insert a vcpu in the RunQ based on vcpus' deadline: 
> + * EDF schedule policy: vcpu with smaller deadline has higher priority;
> + * The vcpu svc to be inserted will be inserted just before the very first 
> + * vcpu iter_svc in the Runqueue whose deadline is equal or larger than 
> + * svc's deadline.
> + */
>
/*
 * Insert svc in the RunQ according to EDF: vcpus with smaller
 * deadlines go first.
 */

Or, if you want, something like (possibly turn into proper English :-P)
"a vcpu with a smaller deadline goes before one with an higher
deadline", but I'd avoid mentioning 'priority'.

Also, what follows that part is not adding much, and it's evident from
the code.

> +static void
> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    spinlock_t *schedule_lock;
> +    
> +    schedule_lock = per_cpu(schedule_data, 
> svc->vcpu->processor).schedule_lock;
> +    ASSERT( spin_is_locked(schedule_lock) );
> +    
> +    /* Debug only */
> +    if ( __vcpu_on_runq(svc) )
> +    {
> +        rt_dump(ops);
> +    }
> +    ASSERT( !__vcpu_on_runq(svc) );
> +
The 'Debug only' comment and the subsequent if must be killed, of
course! the ASSERT is already offering all the necessary debugging
help. :-)

> +    /* svc still has budget */
> +    if ( svc->cur_budget > 0 ) 
> +    {
> +        list_for_each(iter, runq) 
> +        {
> +            struct rt_vcpu * iter_svc = __runq_elem(iter);
> +            if ( iter_svc->cur_budget == 0 ||
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +         }
> +        list_add_tail(&svc->runq_elem, iter);
> +     }
> +    else 
> +    { /* svc has no budget */
>
I'd say put the comment on its own line, but I think you can well kill
it.
> +        list_add(&svc->runq_elem, &prv->flag_vcpu->runq_elem);
>
Mmm... flag_vcpu, eh? I missed it above where it's declared, but I
really don't like the name. 'depleted_vcpus'? Or something else (sorry,
not very creative in this very moment :-))... but flag_vcpu, I found it
confusing.

> +/* 
> + * point per_cpu spinlock to the global system lock; all cpu have same 
> global system lock 
long line.

> + */
> +static void *
> +rt_alloc_pdata(const struct scheduler *ops, int cpu)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    cpumask_set_cpu(cpu, &prv->cpus);
> +
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
> +
> +    printtime();
> +    printk("%s total cpus: %d", __func__, cpumask_weight(&prv->cpus));
> +    /* same as credit2, not a bogus pointer */
> +    return (void *)1;
>
Can you put in the comment the reason why it is ok to do this, without
referencing credit2?

> +static void *
> +rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
> +{
> +    unsigned long flags;
> +    struct rt_dom *sdom;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    printtime();
> +    printk("dom=%d\n", dom->domain_id);
> +
I don't think we want to spam the console at each allocation,
deallocation, initialization and destruction.

Put in a tracepoint, if you want these information to be available while
debugging, but please, avoid the printk.

> +    sdom = xzalloc(struct rt_dom);
> +    if ( sdom == NULL ) 
> +    {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&sdom->vcpu);
> +    INIT_LIST_HEAD(&sdom->sdom_elem);
> +    sdom->dom = dom;
> +
> +    /* spinlock here to insert the dom */
> +    spin_lock_irqsave(&prv->lock, flags);
> +    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
> +    spin_unlock_irqrestore(&prv->lock, flags);
> +
> +    return sdom;
> +}
> +
> +static void
> +rt_free_domdata(const struct scheduler *ops, void *data)
> +{
> +    unsigned long flags;
> +    struct rt_dom *sdom = data;
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    printtime();
> +    printk("dom=%d\n", sdom->dom->domain_id);
> +
Ditto.

> +    spin_lock_irqsave(&prv->lock, flags);
> +    list_del_init(&sdom->sdom_elem);
> +    spin_unlock_irqrestore(&prv->lock, flags);
> +    xfree(data);
> +}
> +
> +static int
> +rt_dom_init(const struct scheduler *ops, struct domain *dom)
> +{
> +    struct rt_dom *sdom;
> +
> +    printtime();
> +    printk("dom=%d\n", dom->domain_id);
> +
And here.

> +    /* IDLE Domain does not link on rt_private */
> +    if ( is_idle_domain(dom) ) 
> +        return 0;
> +
> +    sdom = rt_alloc_domdata(ops, dom);
> +    if ( sdom == NULL ) 
> +    {
> +        printk("%s, failed\n", __func__);
> +        return -ENOMEM;
> +    }
> +    dom->sched_priv = sdom;
> +
> +    return 0;
> +}
> +
> +static void
> +rt_dom_destroy(const struct scheduler *ops, struct domain *dom)
> +{
> +    printtime();
> +    printk("dom=%d\n", dom->domain_id);
> +
And here too.

> +    rt_free_domdata(ops, RT_DOM(dom));
> +}
> +
> +static void *
> +rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
> +{
> +    struct rt_vcpu *svc;
> +    s_time_t now = NOW();
> +    long count;
> +
> +    /* Allocate per-VCPU info */
> +    svc = xzalloc(struct rt_vcpu);
> +    if ( svc == NULL ) 
> +    {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&svc->runq_elem);
> +    INIT_LIST_HEAD(&svc->sdom_elem);
> +    svc->flags = 0U;
> +    svc->sdom = dd;
> +    svc->vcpu = vc;
> +    svc->last_start = 0;            /* init last_start is 0 */
> +
As it is, the comment is not very useful... we can see you're
initializing it to zero. If there's something about the why you do it
that you think it's worth mentioning, go ahead, otherwise, kill it.

> +    svc->period = RT_DEFAULT_PERIOD;
> +    if ( !is_idle_vcpu(vc) )
> +        svc->budget = RT_DEFAULT_BUDGET;
> +
> +    count = (now/MICROSECS(svc->period)) + 1;
> +    /* sync all VCPU's start time to 0 */
> +    svc->cur_deadline += count * MICROSECS(svc->period);
> +
why "+="? What value do you expect cur_deadline to hold at this time?

Also, unless I'm missing something, or doing the math wrong, what you
want to do here is place the deadline one period ahead of NOW().

In my head, this is just:

    svc->cur_deadline = NOW() + svc->period;

What is it that fiddling with count is buying you that I don't see?

About time units, can we do the conversions, once and for all, when the
parameters are assigned to the domain/vcpu and, here inside the core of
the scheduler, only deal with quantities homogeneous with NOW() (since
this scheduler needs to compare and add to what NOW() returns a lot)?

However fast it is these days, or will become in future hardware, to do
a multiplication, I don't see the point of playing all this *1000 and/or
MICROSECS() game down here (and even less in more hot paths).

> +    svc->cur_budget = svc->budget*1000; /* counting in microseconds level */
>
Another perfect example. :-)

> +    /* Debug only: dump new vcpu's info */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
As said before, this has to go, quite possibly by means of replacing it
by some tracing.

> +    return svc;
> +}
> +
> +static void
> +rt_free_vdata(const struct scheduler *ops, void *priv)
> +{
> +    struct rt_vcpu *svc = priv;
> +
> +    /* Debug only: dump freed vcpu's info */
> +    printtime();
> +    rt_dump_vcpu(svc);
>
To be ditched.

> +    xfree(svc);
> +}
> +
> +/*
> + * TODO: Do we need to add vc to the new Runqueue?
> + * This function is called in sched_move_domain() in schedule.c
> + * When move a domain to a new cpupool, 
> + * may have to add vc to the Runqueue of the new cpupool
> + */
> +static void
> +rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu *svc = RT_VCPU(vc);
> +
> +    /* Debug only: dump info of vcpu to insert */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
Ditch.

> +    /* not addlocate idle vcpu to dom vcpu list */
> +    if ( is_idle_vcpu(vc) )
> +        return;
> +
> +    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);   /* add to dom vcpu 
> list */
>
Comment above, to prevent the line to become too long.

> +}
> +
> +/*
> + * TODO: same as rt_vcpu_insert()
> + */
> +static void
> +rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    struct rt_dom * const sdom = svc->sdom;
> +
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    BUG_ON( sdom == NULL );
> +    BUG_ON( __vcpu_on_runq(svc) );
> +
> +    if ( !is_idle_vcpu(vc) ) 
> +        list_del_init(&svc->sdom_elem);
> +}
> +
I remember commenting about these two functions already, during RFC
phase. It does not look like they've changed much, in response to that
commenting.

Can you please re-look at
http://lists.xen.org/archives/html/xen-devel/2014-07/msg01624.html , and
either reply/explain, or cope with that? In summary, what I was asking
was whether adding/removing the vcpu to the list of vcpus of a domain
could be done somewhere or somehow else, as doing it like this means,
for some code-path, removing and then re-adding it right away.

Looking more carefully, I do see that credit2 does the same thing but,
for one, that doesn't make it perfect (I still don't like it :-P), and
for two, credit2 does a bunch of other stuff there, that you don't.

Looking even more carefully, this seems to be related to the TODO you
put (which may be a new addition wrt RFCv2, I can't remember). So, yes,
I seriously think you should take care of the fact that, when this
function is called for moving a domain from a cpupool to another, you
need to move the vcpus from the old cpupool's runqueue to the new one's
runqueue.

Have you tested doing such operation (moving domains between cpupools)?
Because, with the code looking like this, it seems a bit unlikely...

Finally, the insert_vcpu hook is called from sched_init_vcpu() too,
still in schedule.c. And in fact, all the other scheduler use this
function to put a vcpu on its runqueue for the first time. Once you do
this, you a fine behaviour, cpupool-wise, almost for free. You
apparently don't. Why is that? If you'd go for a similar approach, you
will get the same benefit. Where is it that you insert a vcpu in the
RunQ for the first time? Again, why there and not here?

> +/* 
> + * Pick a valid CPU for the vcpu vc
> + * Valid CPU of a vcpu is intesection of vcpu's affinity and available cpus
>
long line again. Please, fix all of these, even the ones I may be
missing in this review.

> + */
> +static int
> +rt_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    cpumask_t cpus;
> +    cpumask_t *online;
> +    int cpu;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> +    cpumask_and(&cpus, &prv->cpus, online);
> +    cpumask_and(&cpus, &cpus, vc->cpu_hard_affinity);
> +
> +    cpu = cpumask_test_cpu(vc->processor, &cpus)
> +            ? vc->processor 
> +            : cpumask_cycle(vc->processor, &cpus);
> +    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> +
> +    return cpu;
> +}
> +
> +/*
> + * Burn budget at microsecond level. 
> + */
> +static void
> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now) 
> +{
> +    s_time_t delta;
> +    long count = 0;
> +
> +    /* don't burn budget for idle VCPU */
> +    if ( is_idle_vcpu(svc->vcpu) ) 
> +    {
> +        return;
> +    }
> +
> +    /* first time called for this svc, update last_start */
> +    if ( svc->last_start == 0 ) 
> +    {
> +        svc->last_start = now;
> +        return;
> +    }
> +
Should not last_start be set to NOW() for the fist time that the vcpu is
selected for execution, rather than the first time this function is
called on the vcpu itself?

> +    /*
> +     * update deadline info: When deadline is in the past,
> +     * it need to be updated to the deadline of the current period,
> +     * and replenish the budget 
> +     */
> +    delta = now - svc->cur_deadline;
> +    if ( delta >= 0 ) 
> +    {
> +        count = ( delta/MICROSECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MICROSECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +
Ditto, about time units and conversions.

I also am sure I said before that I prefer an approach like:

    while ( svc->cur_deadline < now )
        svc->cur_deadline += svc->period;+/*
> + * set/get each vcpu info of each domain
> + */
> +static int
> +rt_dom_cntl(
> +    const struct scheduler *ops, 
> +    struct domain *d, 
> +    struct xen_domctl_scheduler_op *op)
> +{
> +    struct rt_dom * const sdom = RT_DOM(d);
> +    struct list_head *iter;
> +    int vcpu_index = 0;
> +    int rc = 0;
> +
> +    switch ( op->cmd )
> +    {
> +    case XEN_DOMCTL_SCHEDOP_getnumvcpus:
> +        op->u.rt.nr_vcpus = 0;
> +        list_for_each( iter, &sdom->vcpu ) 
> +            vcpu_index++;
> +        op->u.rt.nr_vcpus = vcpu_index;
> +        break;
> +    case XEN_DOMCTL_SCHEDOP_getinfo:
> +        /* for debug use, whenever adjust Dom0 parameter, do global
dump */
> +        if ( d->domain_id == 0 ) 
> +            rt_dump(ops);
> +
> +        vcpu_index = 0;
> +        list_for_each( iter, &sdom->vcpu ) 
> +        {
> +            xen_domctl_sched_rt_params_t local_sched;
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu,
sdom_elem);
> +
> +            if( vcpu_index >= op->u.rt.nr_vcpus )
> +                break;
> +
> +            local_sched.budget = svc->budget;
> +            local_sched.period = svc->period;
> +            local_sched.index = vcpu_index;
> +            if( copy_to_guest_offset(op->u.rt.vcpu, vcpu_index,
&local_sched, 1) )
> +            {
> +                rc = -EFAULT;
> +                break;
> +            }
> +            vcpu_index++;
> +        }
> +        break;
> +    case XEN_DOMCTL_SCHEDOP_putinfo:
> +        list_for_each( iter, &sdom->vcpu ) 
> +        {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu,
sdom_elem);
> +
> +            /* adjust per VCPU parameter */
> +            if ( op->u.rt.vcpu_index == svc->vcpu->vcpu_id ) 
> +            { 
> +                vcpu_index = op->u.rt.vcpu_index;
> +
> +                if ( vcpu_index < 0 ) 
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d
\n",
> +                            vcpu_index);
> +                else
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: "
> +                            "vcpu_index=%d, period=%"PRId64", budget=
%"PRId64"\n",
> +                            vcpu_index, op->u.rt.period,
op->u.rt.budget);
> +
> +                svc->period = op->u.rt.period;
> +                svc->budget = op->u.rt.budget;
> +
> +                break;
> +            }
> +        }
> +        break;
> +    }
> +
> +    return rc;
> +}
> +
> +static struct rt_private _rt_priv;
> +
> +const struct scheduler sched_rt_def = {
> +    .name           = "SMP RT DS Scheduler",
> +    .opt_name       = "rt_ds",
> +    .sched_id       = XEN_SCHEDULER_RT_DS,
> +    .sched_data     = &_rt_priv,
> ++/*
> + * set/get each vcpu info of each domain
> + */
> +static int
> +rt_dom_cntl(
> +    const struct scheduler *ops, 
> +    struct domain *d, 
> +    struct xen_domctl_scheduler_op *op)
> +{
> +    struct rt_dom * const sdom = RT_DOM(d);
> +    struct list_head *iter;
> +    int vcpu_index = 0;
> +    int rc = 0;
> +
> +    switch ( op->cmd )
> +    {
> +    case XEN_DOMCTL_SCHEDOP_getnumvcpus:
> +        op->u.rt.nr_vcpus = 0;
> +        list_for_each( iter, &sdom->vcpu ) 
> +            vcpu_index++;
> +        op->u.rt.nr_vcpus = vcpu_index;
> +        break;
> +    case XEN_DOMCTL_SCHEDOP_getinfo:
> +        /* for debug use, whenever adjust Dom0 parameter, do global
dump */
> +        if ( d->domain_id == 0 ) 
> +            rt_dump(ops);
> +
> +        vcpu_index = 0;
> +        list_for_each( iter, &sdom->vcpu ) 
> +        {
> +            xen_domctl_sched_rt_params_t local_sched;
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu,
sdom_elem);
> +
> +            if( vcpu_index >= op->u.rt.nr_vcpus )
> +                break;
> +
> +            local_sched.budget = svc->budget;
> +            local_sched.period = svc->period;
> +            local_sched.index = vcpu_index;
> +            if( copy_to_guest_offset(op->u.rt.vcpu, vcpu_index,
&local_sched, 1) )
> +            {
> +                rc = -EFAULT;
> +                break;
> +            }
> +            vcpu_index++;
> +        }
> +        break;
> +    case XEN_DOMCTL_SCHEDOP_putinfo:
> +        list_for_each( iter, &sdom->vcpu ) 
> +        {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu,
sdom_elem);
> +
> +            /* adjust per VCPU parameter */
> +            if ( op->u.rt.vcpu_index == svc->vcpu->vcpu_id ) 
> +            { 
> +                vcpu_index = op->u.rt.vcpu_index;
> +
> +                if ( vcpu_index < 0 ) 
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d
\n",
> +                            vcpu_index);
> +                else
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: "
> +                            "vcpu_index=%d, period=%"PRId64", budget=
%"PRId64"\n",
> +                            vcpu_index, op->u.rt.period,
op->u.rt.budget);
> +
> +                svc->period = op->u.rt.period;
> +                svc->budget = op->u.rt.budget;
> +
> +                break;
> +            }
> +        }
> +        break;
> +    }
> +
> +    return rc;
> +}
> +
> +static struct rt_private _rt_priv;
> +
> +const struct scheduler sched_rt_def = {
> +    .name           = "SMP RT DS Scheduler",
> +    .opt_name       = "rt_ds",
> +    .sched_id       = XEN_SCHEDULER_RT_DS,
> +    .sched_data     = &_rt_priv,
> +
> +    .dump_cpu_state = rt_dump_pcpu,
> +    .dump_settings  = rt_dump,
> +    .init           = rt_init,
> +    .deinit         = rt_deinit,
> +    .alloc_pdata    = rt_alloc_pdata,
> +    .free_pdata     = rt_free_pdata,
> +    .alloc_domdata  = rt_alloc_domdata,
> +    .free_domdata   = rt_free_domdata,
> +    .init_domain    = rt_dom_init,
> +    .destroy_domain = rt_dom_destroy,
> +    .alloc_vdata    = rt_alloc_vdata,
> +    .free_vdata     = rt_free_vdata,
> +    .insert_vcpu    = rt_vcpu_insert,
> +    .remove_vcpu    = rt_vcpu_remove,
> +
> +    .adjust         = rt_dom_cntl,
> +
> +    .pick_cpu       = rt_cpu_pick,
> +    .do_schedule    = rt_schedule,
> +    .sleep          = rt_vcpu_sleep,
> +    .wake           = rt_vcpu_wake,
> +    .context_saved  = rt_context_saved,
> +};
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 55503e0..7d2c6d1 100644
> +    .dump_cpu_state = rt_dump_pcpu,
> +    .dump_settings  = rt_dump,
> +    .init           = rt_init,
> +    .deinit         = rt_deinit,
> +    .alloc_pdata    = rt_alloc_pdata,
> +    .free_pdata     = rt_free_pdata,
> +    .alloc_domdata  = rt_alloc_domdata,
> +    .free_domdata   = rt_free_domdata,
> +    .init_domain    = rt_dom_init,
> +    .destroy_domain = rt_dom_destroy,
> +    .alloc_vdata    = rt_alloc_vdata,
> +    .free_vdata     = rt_free_vdata,
> +    .insert_vcpu    = rt_vcpu_insert,
> +    .remove_vcpu    = rt_vcpu_remove,
> +
> +    .adjust         = rt_dom_cntl,
> +
> +    .pick_cpu       = rt_cpu_pick,
> +    .do_schedule    = rt_schedule,
> +    .sleep          = rt_vcpu_sleep,
> +    .wake           = rt_vcpu_wake,
> +    .context_saved  = rt_context_saved,
> +};
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 55503e0..7d2c6d1 100644
    svc->cur_budget = svc->budget;

Not necessarily for speed reasons, but because it's a lot easier to read
and understand.

> +        /* TRACE */
> +        {
> +            struct {
> +                unsigned dom:16,vcpu:16;
> +                unsigned cur_budget_lo, cur_budget_hi;
> +            } d;
> +            d.dom = svc->vcpu->domain->domain_id;
> +            d.vcpu = svc->vcpu->vcpu_id;
> +            d.cur_budget_lo = (unsigned) svc->cur_budget;
> +            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> +            trace_var(TRC_RT_BUDGET_REPLENISH, 1,
> +                      sizeof(d),
> +                      (unsigned char *) &d);
> +        }
> +
> +        return;
> +    }
> +
> +    /* burn at nanoseconds level */
>
The function's doc comments says 'microseconds'. Please, agree with
yourself. :-D :-D

> +    delta = now - svc->last_start;
> +    /* 
> +     * delta < 0 only happens in nested virtualization;
> +     * TODO: how should we handle delta < 0 in a better way? */
>
Ah, nice to hear you tested this in nested virt? Whay config is that,
Xen on top of VMWare (I'm assuming this from what I've seen on the
rt-xen mailing list).

BTW, comment style: '*/' goes to newline.

> +    if ( delta < 0 ) 
> +    {
> +        printk("%s, ATTENTION: now is behind last_start! delta = %ld for ",
> +                __func__, delta);
> +        rt_dump_vcpu(svc);
> +        svc->last_start = now;  /* update last_start */
> +        svc->cur_budget = 0;   /* FIXME: should we recover like this? */
> +        return;
> +    }
> +
Bha, this just should not happen, and if it does, it's either a bug or
an hardware problem, isn't it? I don't think you should do much more
than this. I'd remove the rt_dump_vcpu() but, in this case, I'm fine
with the printk(), as it's something that really should not happen, and
we want to inform sysadmin that something has gone very bad. :-)

Long lines, BTW.

> +    if ( svc->cur_budget == 0 ) 
> +        return;
> +
> +    svc->cur_budget -= delta;
> +    if ( svc->cur_budget < 0 ) 
> +        svc->cur_budget = 0;
> +
I see you're not dealing with overruns (where with dealing I mean try to
compensate). That's fine for now, let's leave this as a future
development.

> +/* 
> + * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
> + * lock is grabbed before calling this function 
> + */
> +static struct rt_vcpu *
> +__runq_pick(const struct scheduler *ops, cpumask_t mask)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct rt_vcpu *svc = NULL;
> +    struct rt_vcpu *iter_svc = NULL;
> +    cpumask_t cpu_common;
> +    cpumask_t *online;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    list_for_each(iter, runq) 
> +    {
> +        iter_svc = __runq_elem(iter);
> +
> +        /* flag vcpu */
> +        if(iter_svc->sdom == NULL)
> +            break;
> +
> +        /* mask is intersection of cpu_hard_affinity and cpupool and 
> priv->cpus */
> +        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
> +        cpumask_and(&cpu_common, online, &prv->cpus);
> +        cpumask_and(&cpu_common, &cpu_common, 
> iter_svc->vcpu->cpu_hard_affinity);
> +        cpumask_and(&cpu_common, &mask, &cpu_common);
> +        if ( cpumask_empty(&cpu_common) )
> +            continue;
> +
I remember asking this in
http://lists.xen.org/archives/html/xen-devel/2014-07/msg01624.html:

"What's in priv->cpus, BTW? How is it different form the cpupool online
mask (returned in 'online' by cpupool_scheduler_cpumask() )?"

while I don't remember having received an answer, and I see it's still.
here in the code. Am I missing something? If not, can you explain?

> +/*
> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
> + * lock is grabbed before calling this function 
> + */
> +static void
> +__repl_update(const struct scheduler *ops, s_time_t now)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct list_head *tmp;
> +    struct rt_vcpu *svc = NULL;
> +
> +    s_time_t diff;
> +    long count;
> +
> +    list_for_each_safe(iter, tmp, runq) 
> +    {
> +        svc = __runq_elem(iter);
> +
> +        /* not update flag_vcpu's budget */
> +        if(svc->sdom == NULL)
> +            continue;
> +
> +        diff = now - svc->cur_deadline;
> +        if ( diff > 0 ) 
> +        {
> +            count = (diff/MICROSECS(svc->period)) + 1;
> +            svc->cur_deadline += count * MICROSECS(svc->period);
> +            svc->cur_budget = svc->budget * 1000;
> +            __runq_remove(svc);
> +            __runq_insert(ops, svc);
> +        }
> +    }
> +}
> +
Ok. As said when reviewing the RFC for this, I see a lot of room for
optimization here (event based replenishments, using timers). However,
I'd be fine with such optimization to happen later on, e.g., if this
makes it for 4.5, in 4.6 cycle.

All I said before about time units and convertions applies here too.
Actually, as I also already said in
http://lists.xen.org/archives/html/xen-devel/2014-07/msg01624.html ,
I think you can wrap this functionality in an helper function, and call
that from here and from all the other places that right now do the same
things.

> +/* 
> + * schedule function for rt scheduler.
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static struct task_slice
> +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t 
> tasklet_work_scheduled)
> +{
> +    const int cpu = smp_processor_id();
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * const scurr = RT_VCPU(current);
> +    struct rt_vcpu * snext = NULL;
> +    struct task_slice ret = { .migrated = 0 };
> +
> +    /* clear ticked bit now that we've been scheduled */
                tickled

> +/*
> + * Pick a vcpu on a cpu to kick out to place the running candidate
> + * Called by wake() and context_saved()
> + * We have a running candidate here, the kick logic is:
> + * Among all the cpus that are within the cpu affinity
> + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> + * 2) if there are any idle vcpu, kick it.
> + * 3) now all pcpus are busy, among all the running vcpus, pick lowest 
> priority one
> + *    if snext has higher priority, kick it.
> + *
> + * TODO:
> + * 1) what if these two vcpus belongs to the same domain?
> + *    replace a vcpu belonging to the same domain introduces more overhead
> + *
> + * lock is grabbed before calling this function 
> + */
> +static void
> +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> +{
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * latest_deadline_vcpu = NULL;    /* lowest priority 
> scheduled */
> +    struct rt_vcpu * iter_svc;
> +    struct vcpu * iter_vc;
> +    int cpu = 0, cpu_to_tickle = 0;
> +    cpumask_t not_tickled;
> +    cpumask_t *online;
> +
> +    if ( new == NULL || is_idle_vcpu(new->vcpu) ) 
> +        return;
> +
> +    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
> +    cpumask_and(&not_tickled, online, &prv->cpus);
> +    cpumask_and(&not_tickled, &not_tickled, new->vcpu->cpu_hard_affinity);
> +    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
> +
What I said about prv->cpus applies here too...

> +/* 
> + * Should always wake up runnable vcpu, put it back to RunQ. 
> + * Check priority to raise interrupt 
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + * TODO: what if these two vcpus belongs to the same domain? 
> + */
> +static void
> +rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    s_time_t diff;
> +    s_time_t now = NOW();
> +    long count = 0;
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * snext = NULL;        /* highest priority on RunQ */
> +
> +    BUG_ON( is_idle_vcpu(vc) );
> +
> +    if ( unlikely(curr_on_cpu(vc->processor) == vc) ) 
> +        return;
> +
> +    /* on RunQ, just update info is ok */
> +    if ( unlikely(__vcpu_on_runq(svc)) ) 
> +        return;
> +
> +    /* If context hasn't been saved for this vcpu yet, we can't put it on
> +     * the Runqueue. Instead, we set a flag so that it will be put on the 
> Runqueue
> +     * After the context has been saved. */
> +    if ( unlikely(test_bit(__RT_scheduled, &svc->flags)) ) 
> +    {
> +        set_bit(__RT_delayed_runq_add, &svc->flags);
> +        return;
> +    }
> +
> +    /* update deadline info */
> +    diff = now - svc->cur_deadline;
> +    if ( diff >= 0 ) 
> +    {
> +        count = ( diff/MICROSECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MICROSECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +    }
> +
Time units and helper function.

> +    __runq_insert(ops, svc);
> +    __repl_update(ops, now);
> +    snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL valid 
> cpus */
> +    runq_tickle(ops, snext);
> +
> +    return;
> +}

> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -69,6 +69,7 @@ static const struct scheduler *schedulers[] = {
>      &sched_credit_def,
>      &sched_credit2_def,
>      &sched_arinc653_def,
> +    &sched_rt_def,
>
rt_ds?

So, overall, I think this is in a reasonably good state. There are some
issues, but mostly not critical stuff, easy to solve in another round or
two.

One thing I feel like asking is, this time, if you can make sure to
address all the points being raised during the review... I know it
happens to miss/forget things, especially in long and complex pieces of
code, like in this patch, but please, try not to. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.