[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.



I'm not replying inline because this is a more general response that
is not limited
to any particular comment.

Separating the replenishment from the scheduler may be problematic. The nature
of the gEDF scheduler is that when a priority changes it should be instantly
reflected. If replenishments were done by themsevles, then the scheduler may
decide to wait for some period of time, and in this same time period a
replenishment
could change what should be executing. Either the gEDF policy will be
violated or
the replenishment routine would have to notify the scheduler after any
replenishment, requiring some form of interaction and possibly more scheduler
invocations than the current structure. The other option is for the scheduler to
check for replenishments, as it does now. Without any interaction a violation of
the policy could ensue. The gEDF is not like the credit scheduler where there is
an accounting period where, during this time, policy may be violated
temporarily.
This model is much easier in the credit scheduler because there is a
fixed slice.
Imagine if we wanted an accounting period of 0 for the scheduler. The
only option
would be to recompute the replenishments before any scheduling and to run the
scheduler at exactly when these replenishments would occur. As in,
replenishments
and scheduling would have to be coalesced. This is essentially what
the current RTDS
patch is doing.

Further, with two timer routines we now need to deal with any ordering
or preemption
between them (or logic to prevent / order such) which I could argue is far more
complexity than having them done at once as it is now. Being able to
reason about
the execution with one thread in mind is easier for most people than
several threads.
If both are armed for the same time, which should execute? Scheduler
first before
a possibly applicable replenishment? Or replenishment always first?
Both with added
logic to enforce this and prevent preemption, of course.

Due to this, it is my belief that by the nature of the gEDF policy the
current solution is
better, mostly because it appears to actually be the least complex way that is
functionally correct. The gEDF policy just isn't well suited for
decoupling events, as
the events are highly dependent on one another, and particularly dependent at an
instantaneous timescale. It would be a hard pitch for a gEDF scheduler with a
similar "accounting period" where the gEDF policy could be violated. That is
blasphemy in the real-time world. Any other option for implementing the policy
correctly would include more headache due to a multiple execution mindset and
any enforcement and interaction logic.

Its also worth noting that the stereotypical textbook event-driven
model is as we have
now: a single event loop that handles events. In this case the scheduler is the
conceptually the main loop and this makes perfect sense: its deciding
what to run
(think of the running VCPUs as callback functions and the abstraction
falls into place
perfectly). The event loop needs some information (about replenishments) before
deciding, and collecting this would be part of the decode and decision
phase for an
event signal.

Not only is there a complexity issue, but as mentioned before this may be the
best performing option, making the most information available and the
best timing
decision possible. Adding a few extra cycles to a hot path, even with
a lock being
held, is worth it if the scheduler and context switching is done less.
I'm sure the
entire pre-change scheduler is much longer than the time we've added, and the
other model proposed with more timers may require just as much added time as
well as aforementioned complexity, or would suffer from less
intelligent decisions
(hence more execution and/or policy violation) as dependance and
cooperation between
the timers decreased.

For the above reasons I think you should reconsider the current
implementation, as it
appears it may be the least complex and easiest to reason about
solution. This is not
to say that the current patch doesn't have issues. We would still aim
to fix any of the
concerns about multiple softirqs or CPU tickles.

Let me know if I'm missing some key insight into how the behavior
could be implemented
correctly and beautifully using the multiple timer approach. I simply
don't see how it can
be done without heavy interaction and information sharing between them
which really
defeats the purpose.

Regards,
~Dagaen

On Sat, Jun 13, 2015 at 4:33 PM, Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx> wrote:
>> No HTML, please.
>
> Got it, sorry.
>
>>>         And note that, when I say "timer", I mean an actual Xen timer,
>>>         i.e.,
>>>         those things that are started, stopped, and with a timer
>>>         handling
>>>         routine being called when they expire. For an example, you can
>>>         have a
>>>         look, in sched_credit.c, at 'ticker' or 'master_ticker'.
>>>
>>>
>>> I will look at this. However, the current solution is likely
>>> functionally equivalent and, with only one timer and a single list
>>> used to arm it, I'm not sure if using a Xen timer is necessary.
>>>
>> It is functionally equivalent, by chance (see the issue about calling
>> runq_tickle() on current vcpus in my reply to Meng). The reason why a
>> different approach is required is making the code look better, reducing
>> (instead than increasing) the complexity of sched_rt.c, lowering the
>> overhead caused by long running operations performed while holding the
>> global scheduler spin lock, and improving scalability.
>>
>>> Do they incur any extra overhead or coarsen granularity?
>>>
>> "extra overhead or coarsen granularity" as compared to what? s_timer in
>> schedule.c (which is what you're using) is one of them already!
>>
>> What I meant with "an actual timer" is that you should ad a new one, and
>> move some of the stuff currently being done in rt_schedule() in its
>> handling routine, rather than just adding a new queue of events to be
>> serviced by abusing s_timer.
>
> Okay, I was wondering what you mean by "an actual timer."
>
>>
>>>         Deciding between a) or b) isn't easy, without actually trying
>>>         to
>>>         implement them. I personally prefer b), and I think it would
>>>         be a
>>>         superior long term solution (see right below), but given te
>>>         current
>>>         architecture of the RTDS scheduler (e.g., the fact that it has
>>>         a global
>>>         runq, etc), I won't nack a), which would most likely be
>>>         simpler.
>>>
>>>
>>> I agree b) would may be better in the long run, but with the current
>>> architecture a) is simpler. b) can be future work as the scheduler
>>> evolves.
>>>
>> Sure. And in fact, I'm fine with a), if done properly.
>>>
>>>         With this patch, you're still using rt_schedule() to do both
>>>         scheduling
>>>         and replenishments, although you're now doing it at actual
>>>         replenishment
>>>         times, instead of checking for them every millisecond.
>>
>>>
>>> I think once you understand that the timerq is not only
>>> replenishments, but any event that could cause a branch is the
>>> scheduler behavior, it becomes more palatable to have them
>>> intertwined.
>>>
>> I got that, and I'm asking you to do it differently.
>>
>>> Really, the timerq and scheduling aren't as intertwined as they
>>> appear, they are piratically isolated functionally. Its just that the
>>> timerq is most naturally serviced during scheduling events when data
>>> that effects scheduling decisions is changing. Its straightforward and
>>> efficient.
>>>
>> Yeah, replenishments are 'naturally serviced' in a 'straightforward and
>> efficient' way by looping on all runnable vcpus, in more than one place,
>> from within super-hot paths, with the global scheduler spin lock held.
>> Neat! :-/
>>
>> Oh, and that's before you introducing of another list to be taken care
>> of, still under the same conditions. :-O
>
> The paths are already hot, so adding a small amount of extra work is a small
> percentage increase. I'm usually against this kind of thing, too, but 
> separating
> to another timer, while runnable independently, may actually be more work if 
> it
> needs to use some of the same maintenance behavior. I guess the main argument
> against is maintainability.
>
>>
>>>         Also, the bits and pieces that you need to add in order to
>>>         deal with
>>>         this new queue is, really, making things even more complex
>>>         than they are
>>>         now, which is undesirable.
>>
>>> While it does add some complexity, I don't feel a single queue and
>>> managing it is that much extra complexity.
>>>
>> But in fact the point of making the scheduler 'event driven' is to take
>> advantage of the chances that this offers for getting stuff *out* of
>> rt_schedule(), and the purpose is not "not adding that much extra
>> complexity", is making it _simpler_.
>
> I understand where you are coming from now. I was looking at is mostly
> from the perspective of performance. This explains our differences.
>
>>
>>
>>>         > diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
>>>         > index 7c39a9e..25f0458 100644
>>>         > --- a/xen/common/sched_rt.c
>>>         > +++ b/xen/common/sched_rt.c
>>>
>>>         > @@ -156,6 +160,7 @@ struct rt_vcpu {
>>>         >      s_time_t cur_budget;        /* current budget */
>>>         >      s_time_t last_start;        /* last start time */
>>>         >      s_time_t cur_deadline;      /* current deadline for EDF
>>>         */
>>>         > +    s_time_t next_sched_needed; /* next time to make
>>>         scheduling decision */
>>>         >
>>>         As an example of why I said that things should become simpler,
>>>         and are
>>>         instead becoming more complex: with my solution, you don't
>>>         need anything
>>>         like this. In fact, budget enforcement is already being done
>>>         ok in
>>>         rt_schedule(), so no need to do anything more/different about
>>>         it.
>>>         Replenishments should be programmed to fire at cur_deadline,
>>>         so again,
>>>         no need to add this new field, and multiplex its actual value
>>>         between
>>>         budget and deadline (which, yes, counts as being something
>>>         rather
>>>         complex for me! :-D).
>>>
>>>
>>> As mentioned, the timerq is handling all events that could change the
>>> scheduling decision, not just replenishments.
>>>
>> Yes, exactly, it's handling *too much* events. :-)
>>
>> For example, have a look at 'struct vcpu', in
>> xen/include/xen/sched.h. It's got 3 timers in it:
>>
>>     struct timer     periodic_timer;
>>     struct timer     singleshot_timer;
>>     struct timer     poll_timer;    /* timeout for SCHEDOP_poll */
>>
>> It probably would have been possible to just use one, with a single and
>> mixed event queue, as you're doing, a 1k lines handling routines, etc.
>>
>> Do you it it would have been easier or harder to implement, understand,
>> debug and maintain? I bet on harder.
>
> Point taken.
>
>>
>>> This tracks the earliest time anything cause this to be scheduled
>>> differently, and could be extended if any more insights are found.
>>> Budget enforcement is being done in rt_schedule but its being done by
>>> this very list: a budget expiration is one possible event that would
>>> require a vcpu reschedule.
>>>
>> OMG, 'extended'... You're thinking to actually put more stuff in
>> there?!? :-O
>>
>> It's a rather key and already too long and complex critical section, so
>> please, let's aim at shortening and making it simpler, i.e., at
>> improving things, not making them worse.
>
> This can again be explained by our goal mismatch. I see where you are
> coming from now.
>
>>>
>>>         > +/*
>>>         > + * Return how far the lowest time on the timerq (that is
>>>         after NOW) is in the
>>>         > + * future.
>>>         > + * Lock must be grabbed before calling this.
>>>         > + */
>>>         > +static s_time_t __timerq_next(const struct scheduler *ops,
>>>         s_time_t now)
>>>         > +{
>>>         > +    struct list_head *timerq = rt_timerq(ops);
>>>         > +    struct list_head *iter, *tmp;
>>>         > +
>>>         > +    list_for_each_safe(iter, tmp, timerq)
>>>         > +    {
>>>         > +        struct rt_vcpu * iter_svc = __t_elem(iter);
>>>         > +
>>>         > +        if ( iter_svc->next_sched_needed > now )
>>>         > +            return (iter_svc->next_sched_needed - now);
>>>         > +        else
>>>         > +            __t_remove(iter_svc);
>>>         > +    }
>>>         > +
>>>         > +    return MAX_SCHEDULE;
>>>         > +}
>>>         >
>>>         If using a timer, you can just get rid of items while, in the
>>>         timer
>>>         routine, you handle the event associated to them.
>>>
>>>         Also, why is MAX_SCHEDULE still there?
>>>
>>>
>>> Honestly, events do not have to be removed here, but its done to
>>> prevent walking a list of stale values to get to the newest one on the
>>> next call. This is done more for performance. Their non-removal would
>>> be functionally equivalent.
>>>
>> Of course you should remove the stale entries, I certainly was not
>> arguing that the list should just grow indefinitely!
>>
>> Point is, again, that this is another list walk occurring in an hot
>> path, with a big spin lock held. We want to get rid of such thing, not
>> adding more of it.
>>
>>> With the timer alternative you mention, where would the timer events
>>> and their data be held? I think that could be extra complexity, unless
>>> I fail to understand the idea completely.
>>>
>> In an event queue like yours, of course. But you won't go through it
>> and/or bookkeep it while in hot paths, with the scheduler lock held.
>>
>> See my email to Meng to have more details on what I have in mind, or
>> fell free to ask more details.
>>
>>> MAX_SCHEDULE may not be required, but I have it there as an "in case."
>>> For example, it will cause the scheduler to be called after
>>> MAX_SCHEDULE even when no events are occurring (such as no vcpus). Its
>>> there to ensure progress in case of any bugs or unexpected behavior.
>>>
>> Wait, so, all this work, and then you still want to call rt_schedule()
>> every millisecond, when there's nothing to do?!?! :-O
>>
>>>         That being said, in this specific case, you're moving up
>>>         runq_tickle(),
>>>         the purpose of which is to trigger a call to rt_schedule()
>>>         (after going
>>>         through the softirq machinery)... in order to call it in
>>>         rt_schedule().
>>>         That's pretty tricky, and is not how tickling should work.
>>
>>> It set up to only tickle if needed. I'm not sure if this is an issue,
>>>
>> It's wrong, AFAICT. See my reply to Meng and, please, comment by
>> replying to it, if you've got anything to say about this, to make the
>> discussion easier to follow.
>>
>> Regards,
>> Dario
>
> I understand what caused our mismatch now. I think option a) you
> mentioned makes sense
> given your values. I am going to look into the details of this
> approach and get back with any
> questions or concerns.
>
> Regards,
> ~Dagaen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.