[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4] xen: sched: convert RTDS from time to event driven model





On 2/2/2016 10:08 AM, Dario Faggioli wrote:
On Sun, 2016-01-31 at 23:32 -0500, Tianyang Chen wrote:
v4 is meant for discussion on the addition of replq.

In which cases, you should always put something like RFC in the
subject, so people knows what the intent is even by just skimming their
inboxes.

Got it.
Changes since v3:
        removed running queue.
        added repl queue to keep track of repl events.
        timer is now per scheduler.
        timer is init on a valid cpu in a cpupool.

Bugs to be fixed: Cpupool and locks. When a pcpu is removed from a
pool and added to another, the lock equality assert in free_pdata()
fails when Pool-0 is using rtds.

Known issue. I will fix it for both Credit2 and RTDS, so just ignore
it.


Yep. I was wondering if I should fix this before sending out v4.

diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index 2e5430f..c36e5de 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -16,6 +16,7 @@
  #include <xen/delay.h>
  #include <xen/event.h>
  #include <xen/time.h>
+#include <xen/timer.h>
  #include <xen/perfc.h>
  #include <xen/sched-if.h>
  #include <xen/softirq.h>
@@ -87,7 +88,7 @@
  #define RTDS_DEFAULT_BUDGET     (MICROSECS(4000))

  #define UPDATE_LIMIT_SHIFT      10
-#define MAX_SCHEDULE            (MILLISECS(1))
+

Unrelated to the topic. Can we have a dedicated patch that adds a
comment above UPDATE_LIMIT_SHIFT, explainig what it is and how it's
used.

This is something low priority, of course.


Sure.

@@ -160,6 +166,7 @@ struct rt_private {
   */
  struct rt_vcpu {
      struct list_head q_elem;    /* on the runq/depletedq list */
+    struct list_head p_elem;    /* on the repl event list */

Mmm... 'p_elem' because the previous one was 'q_elem', and 'p' follows
'q', I guess. It feels a bit obscure, though, and I'd prefer a more
'talking' name.

For now, and at this level, I guess I can live with it. But in other
places, things need to be improved (see below).

@@ -213,8 +220,14 @@ static inline struct list_head
*rt_depletedq(const struct scheduler *ops)
      return &rt_priv(ops)->depletedq;
  }

+static inline struct list_head *rt_replq(const struct scheduler
*ops)
+{
+    return &rt_priv(ops)->replq;
+}
+
  /*
- * Queue helper functions for runq and depletedq
+ * Queue helper functions for runq, depletedq
+ * and repl event q
   */

So, in comments, can we use full words as much as possible. It makes
them a lot easier to read (so, e.g., "replenishment events queue" or
"queue of the replenishment events").

I know this very file's style is not like that, from this point of
view... and, in fact, this is something I would like to change, when we
get the chance (i.e., when touching the code for other reasons, like in
this case).

@@ -228,6 +241,18 @@ __q_elem(struct list_head *elem)
      return list_entry(elem, struct rt_vcpu, q_elem);
  }

+static struct rt_vcpu *
+__p_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct rt_vcpu, p_elem);
+}
+
So, while this may well still be fine...

+static int
+__vcpu_on_p(const struct rt_vcpu *svc)
+{
+   return !list_empty(&svc->p_elem);
+}
+

...here I really would like to go for something that will make it much
more obvious, just by giving a look at the code, where we are actually
checking the vcpu to be, without having to remember that p is the
replenishment queue.

So, I don't know, maybe something like vcpu_on_replq(), or
vcpu_needs_replenish() ?

@@ -387,6 +412,13 @@ __q_remove(struct rt_vcpu *svc)
          list_del_init(&svc->q_elem);
  }

+static inline void
+__p_remove(struct rt_vcpu *svc)
+{
+    if ( __vcpu_on_p(svc) )
+        list_del_init(&svc->p_elem);
+}
+
replq_remove(), or vcpu_repl_cancel() (or something else) ?

@@ -421,6 +453,32 @@ __runq_insert(const struct scheduler *ops,
struct rt_vcpu *svc)
  }

  /*
+ * Insert svc into the repl even list:
+ * vcpus that needs to be repl earlier go first.
+ */
+static void
+__replq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
+{

This is exactly what I've been trying to say above: __replq_insert() is
ok, much better than __q_insert()! Do the same everywhere else. :-)


Indeed, some of the names are confusing the first time I looked at RTDS.

+    struct rt_private *prv = rt_priv(ops);
+    struct list_head *replq = rt_replq(ops);
+    struct list_head *iter;
+
+    ASSERT( spin_is_locked(&prv->lock) );
+
Is this really useful? Considering where this is called, I don't think
it is.


This basically comes from __q_insert(). I have been searching for the definitions of vcpu_schedule_lock_irq() and pcpu_schedule_lock_irq(). Do you know where they are defined? I did some name search in the entire xen directory but still nothing.

+    ASSERT( !__vcpu_on_p(svc) );
+
+    list_for_each(iter, replq)
+    {
+        struct rt_vcpu * iter_svc = __p_elem(iter);
+        if ( svc->cur_deadline <= iter_svc->cur_deadline )
+                break;
+    }
+
+    list_add_tail(&svc->p_elem, iter);
+}
This looks ok. The list_for_each()+list_ad_tail() part would be
basically identical to what we have inside __runq_insert(), thgough,
wouldn't it?

It may make sense to define an _ordered_queue_insert() (or
_deadline_queue_insert(), since it's alwas the deadline that is
compared) utility function that we can then call in both cases.

@@ -449,6 +507,7 @@ rt_init(struct scheduler *ops)
      INIT_LIST_HEAD(&prv->sdom);
      INIT_LIST_HEAD(&prv->runq);
      INIT_LIST_HEAD(&prv->depletedq);
+    INIT_LIST_HEAD(&prv->replq);

Is there a reason why you don't do at least the allocation of the timer
here? Because, to me, this looks the best place for that.

      cpumask_clear(&prv->tickled);

@@ -473,6 +532,9 @@ rt_deinit(const struct scheduler *ops)
          xfree(_cpumask_scratch);
          _cpumask_scratch = NULL;
      }
+
+    kill_timer(prv->repl_timer);
+
      xfree(prv);

And here, you're freeing prv, without having freed prv->timer, which is
therefore leaked.

@@ -632,6 +699,17 @@ rt_vcpu_insert(const struct scheduler *ops,
struct vcpu *vc)
      vcpu_schedule_unlock_irq(lock, vc);

      SCHED_STAT_CRANK(vcpu_insert);
+
+    if( prv->repl_timer == NULL )
+    {
+        /* vc->processor has been set in schedule.c */
+        cpu = vc->processor;
+
+        repl_timer = xzalloc(struct timer);
+
+        prv->repl_timer = repl_timer;
+        init_timer(repl_timer, repl_handler, (void *)ops, cpu);
+    }
  }

This belong to rt_init, IMO.


When a new pool is created, the corresponding scheduler is also initialized. But there is no pcpu assigned to that pool. If init_timer() happens in rt_init, how does it know which cpu to pick? When move_domain() is called, there must be some pcpus in the destination pool and then vcpu_insert happens.

@@ -841,7 +881,6 @@ rt_schedule(const struct scheduler *ops, s_time_t
now, bool_t tasklet_work_sched
      /* burn_budget would return for IDLE VCPU */
      burn_budget(ops, scurr, now);

-    __repl_update(ops, now);

__repl_update() going away is of course fine. But we need to enact the
budget enforcement logic somewhere in here, don't we?


Sorry about that, I meant to add it in this version.

@@ -924,6 +967,10 @@ rt_vcpu_sleep(const struct scheduler *ops,
struct vcpu *vc)
          __q_remove(svc);
      else if ( svc->flags & RTDS_delayed_runq_add )
          clear_bit(__RTDS_delayed_runq_add, &svc->flags);
+
+    /* stop tracking the repl time of this vcpu */
+   /* if( __vcpu_on_p(svc) )
+       __p_remove(svc); */
  }

Is it ok to kill the replenishment in this case?

This is a genuine question. What does the dynamic DS algorithm that
you're trying to implement here says about this? Meng, maybe you can
help here?

Is it ok to do this _because_ you then handle the situation of a
replenishment having to had happened while the vcpu was asleep in
rt_vcpu_wake (the '(now>=svc->cur_deadline)' thing)? Yes... it probably
is ok for that reason...


I had some discussions with Meng earlier and we haven't settled on this yet. But yes, sleeping vcpus will be replenished in wake().

@@ -1027,23 +1074,21 @@ rt_vcpu_wake(const struct scheduler *ops,
struct vcpu *vc)
      struct rt_vcpu * const svc = rt_vcpu(vc);
      s_time_t now = NOW();
      struct rt_private *prv = rt_priv(ops);
-    struct rt_vcpu *snext = NULL; /* highest priority on RunQ */
-    struct rt_dom *sdom = NULL;
-    cpumask_t *online;
+    struct timer *repl_timer = prv->repl_timer;

      BUG_ON( is_idle_vcpu(vc) );

      if ( unlikely(curr_on_cpu(vc->processor) == vc) )
      {
          SCHED_STAT_CRANK(vcpu_wake_running);
-        return;
+        goto out;
      }

      /* on RunQ/DepletedQ, just update info is ok */
      if ( unlikely(__vcpu_on_q(svc)) )
      {
          SCHED_STAT_CRANK(vcpu_wake_onrunq);
-        return;
+        goto out;
      }

      if ( likely(vcpu_runnable(vc)) )

Mmm.. no. At least the first two cases, they should really just be
'return'-s.

As you say yourself in the comment further down, right below the 'out:'
label "a newly waken-up vcpu could xxx". If this is running on a pCPU,
or it is in a runqueue already, it is not a newly woken-up vcpu.

In fact, we need to check for this cases, but let's at least made them
act like NOPs, as all other schedulers do and as it's correct.

@@ -1058,25 +1103,39 @@ rt_vcpu_wake(const struct scheduler *ops,
struct vcpu *vc)
      if ( unlikely(svc->flags & RTDS_scheduled) )
      {
          set_bit(__RTDS_delayed_runq_add, &svc->flags);
-        return;
+        goto out;
      }

In this case, I'm less sure. I think this should just return as well,
but it may depend on how other stuff are done (like budget
enforcement), so it's hard to think at it right now, but consider this
when working on next version.


OK. For the second case, I guess there is nothing preventing a vcpu that sleeps for a long time on runq to be picked before other vcpus. And also, repl_handler removes vcpus that are not runnable and if they are not added back here, where should we start tracking them? This goes back to if it's necessary to remove a un-runnable vcpu.

+    /* budget repl here is needed before inserting back to runq. If
so,
+     * it should be taken out of replq and put back. This can
potentially
+     * cause the timer handler to replenish no vcpu when it triggers
if
+     * the replenishment is done here already.
+     */

Coding style for long comments wants them to have both "wings" (you're
missing the opening one, '/*').

      if ( now >= svc->cur_deadline)
+    {
          rt_update_deadline(now, svc);
+        if( __vcpu_on_p(svc) )
+            __p_remove(svc);
+    }

Well, then maybe the __p_remove() function can be made smart enough to:
  - check whether the one being removed is the first element of the
    queue;
  - if it is, disarm the timer
  - if there still are elements in the queue, rearm the timer to fire
at
    the new first's deadline

This will be useful at least in another case (rt_vcpu_sleep()).

      /* insert svc to runq/depletedq because svc is not in queue now
*/
      __runq_insert(ops, svc);

-    __repl_update(ops, now);
-
-    ASSERT(!list_empty(&prv->sdom));
-    sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
-    online = cpupool_domain_cpumask(sdom->dom);
-    snext = __runq_pick(ops, online); /* pick snext from ALL valid
cpus */
-
-    runq_tickle(ops, snext);
+    runq_tickle(ops, svc);

-    return;
+out:
+    /* a newly waken-up vcpu could have an earlier release time
+     * or it could be the first to program the timer
+     */
+    if( repl_timer->expires == 0 || repl_timer->expires > svc-
cur_deadline )
+        set_timer(repl_timer,svc->cur_deadline);
+
And again, can't this logic be part of __rqplq_insert()? I.e.:
  - do the insertion
  - if the inserted element is the new head of the queue, reprogram the
    timer


Yeah I agree that insert() and remove() can be handle some of that.

+    /* start tracking the repl time of this vcpu here
+    * it stays on the replq unless it goes to sleep
+    * or marked as not-runnable
+    */
+    if( !__vcpu_on_p(svc) )
+       __replq_insert(ops, svc);
  }

@@ -1168,6 +1216,80 @@ rt_dom_cntl(
      return rc;
  }

+/* The replenishment timer handler picks vcpus
+ * from the replq and does the actual replenishment
+ * the replq only keeps track of runnable vcpus
+ */
+static void repl_handler(void *data){
+    unsigned long flags;
+    s_time_t now = NOW();
+    s_time_t t_next = LONG_MAX; /* the next time timer should be
triggered */
+    struct scheduler *ops = data;
+    struct rt_private *prv = rt_priv(ops);
+    struct list_head *replq = rt_replq(ops);
+    struct timer *repl_timer = prv->repl_timer;
+    struct list_head *iter, *tmp;
+    struct rt_vcpu *svc = NULL;
+
+    stop_timer(repl_timer);
+
+    spin_lock_irqsave(&prv->lock, flags);
+
+    list_for_each_safe(iter, tmp, replq)
+    {
+        svc = __p_elem(iter);
+
+        if ( now >= svc->cur_deadline )
+        {
+            rt_update_deadline(now, svc);
+
+            if( t_next > svc->cur_deadline )
+                t_next = svc->cur_deadline;
+
Why do you need to do this? The next time at which the timer needs to
be reprogrammed is, after you'll have done all the replenishments that
needed to be done, here in this loop, the deadline of the new head of
the replenishment queue (if there still are elements in there).


Agree.

+            /* when the replenishment happens
+             * svc is either on a pcpu or on
+             * runq/depletedq
+             */
+            if( __vcpu_on_q(svc) )
+            {
+                /* put back to runq */
+                __q_remove(svc);
+                __runq_insert(ops, svc);
+                runq_tickle(ops, svc);
+            }
+
Aha! So, it looks the budget enforcement logic ended up here? Mmm...
no, I think that doing it in rt_schedule(), as we agreed, would make
things simpler and better looking, both here and there.

This is the replenishment timer handler, and it really should do one
thins: handle replenishment events. That's it! ;-P


Oh I might be misunderstanding what should be put in rt_schedule(). Was it specifically for handling a vcpu that shouldn't be taken off a core?
Why are you referring this piece of code as budget enforcement?

Isn't it necessary to modify the runq here after budget replenishment has happened? If runq goes out of order here, will rt_schedule() be sorting it?

+            /* resort replq, keep track of this
+             * vcpu if it's still runnable. If
+             * at this point no vcpu are, then
+             * the timer waits for wake() to
+             * program it.
+             */
+            __p_remove(svc);
+
+            if( vcpu_runnable(svc->vcpu) )
+                __replq_insert(ops, svc);
+        }
+
+        else
+        {
+            /* Break out of the loop if a vcpu's
+             * cur_deadline is bigger than now.
+             * Also when some replenishment is done
+             * in wake(), the code could end up here
+             * if the time fires earlier than it should
+            */
+            if( t_next > svc->cur_deadline )
+                t_next = svc->cur_deadline;
+            break;
+        }
+    }
+
+    set_timer(repl_timer, t_next);
+
Well, what if there are no more replenishments to be performed? You're
still reprogramming the timer, either at t_next or at cur_deadline,
which looks arbitrary.

If there are no replenishments, let's just keep the timer off.


The timer fires earlier than it should when some replenishment is done in wake(). In this situation, no replenishment will be done and the timer will not be programmed until a vcpu wakes up. What if all vcpus wake up and get replenished before the timer fires, it replies on wake() solely to program it if it's done here. If none of the vcpus is going to sleep afterwards but they all have replenished their budget, no one will do the replenishment forever. That's why I set the default value of t_next to be the first svc on replq.

Thanks,

Tianyang

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.