[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] schedulers and topology exposing questions



On Tue, Feb 02, 2016 at 12:45:00PM +0100, Dario Faggioli wrote:
> On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote:
> > > 
> > > So, may I ask what piece of (Linux) code are we actually talking
> > > about?
> > > Because I had a quick look, and could not find where what you
> > > describe
> > > happens....
> > 
> > udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout
> > The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the
> > UDP by having a diffrent timeout.
> > 
> Ha, recvmsg! At some point you mentioned sendmsg, and I was looking
> there and seeing nothing! But yes, it indeed makes sense to consider
> the receiving side... let me have a look...
> 
> So, it looks to me that this is what happens:
> 
>  udp_recvmsg(noblock=0)
>    |
>    ---> __skb_recv_datagram(flags=0) {
>                 timeo = sock_rcvtimeo(flags=0) /* returns sk->sk_rcvtimeo */
>                 do {...} wait_for_more_packets(timeo);
>                            |
>                            ---> schedule_timeor(timeo)
> 
> So, at least in Linux 4.4, the timeout used is the one defined in
> sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed
> some link wrong, which can well be the case):
> 
> http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31
> #define SO_RCVTIMEO     20
> 
> So there looks to be a timeout. But anyways, let's check
> schedule_timeout().
> 
> > And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just
> > goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER.
> > 
> So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does:
> 
> schedule_timeout(SCHEDULE_TIMEOUT) {
>     schedule();
>     return;
> }
> 
> If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a
> valid value), the function does:
> 
> schedule_timeout(timeout) {
>     struct timer_list timer;
> 
>     setup_timer_on_stack(&timer);
>     __mod_timer(&timer);
>     schedule();
>     del_singleshot_timer_sync(&timer);
>     destroy_timer_on_stack(&timer);
>     return;
> }
> 
> So, in both cases, it pretty much calls schedule() just about
> immediately. And when schedule() it's called, the calling process --
> which would be out UDP receiver-- goes to sleep.
> 
> The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not
> arrange for anyone to wakeup the thread that is going to sleep. In
> theory, it could even be stuck forever... Of course, this depends on
> whether the receiver thread is on a runqueue or not, if (in case it's
> not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE,
> etc., and, in prractice, it never happens! :-D
> 
> In this case, I think we take the other branch (the one 'with
> timeout'). But even if we would take this one, I would expect the
> receiver thread to not be on any runqueue, but yet to be (either in
> interruptible or not state) in a blocking list from where it is taken
> out when a packet arrives.
> 
> In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above
> is still true, but a timer is set before calling schedule() and putting
> the thread to sleep. This means that, in case nothing that would wakeup
> such thread happens, or in case it hasn't happened yet when the timeout
> expires, the thread is woken up by the timer.

Right.
> 
> And in fact, schedule_timeout() is not a different way, with respect to
> just calling schedule(), to going to sleep. It is the way you go to
> sleep for at most some amount of time... But in all cases, you just and
> immediately go to sleep!
> 
> And I also am not sure I see where all that discussion you've had with
> George about IPIs fit into this all... The IPI that will trigger the
> call to schedule() that will actually put back to execution the thread
> that we're putting to sleep in here (i.e., the receiver), happens when
> the sender manages to send a packet (actually, when the packet arrives,
> I think) _or_ when the timer expires.

The IPI were observed when SMT was exposed to the guest. That is because
the Linux scheduler put both applications on the same CPU - udp_sender
and udp_receiver. Which meant that the 'schedule' call would immediately
pick the next application (udp_sender) and schedule it (and send an IPI 
to itself to do that).

> 
> The two possible calls to schedule() in schedule_timeout() behave
> exactly in the same way, and I don't think having a timeout or not is
> responsible for any particular behavior.

Correct. The quirk was that if the applications were on seperate
CPUs - the "thread [would be] woken up by the timer". While if they
were on the same CPU - the scheduler would pick the next application
on the run-queue (which coincidentally was the UDP sender - or receiver).

> 
> What I think it's happening is this: when such a call to schedule()
> (from inside schedule_timeout(), I mean) is made what happens is that
> the receiver task just goes to sleep, and another one, perhaps the
> sender, is executed. The sender sends the packet, which arrives before
> the timeout, and the receiver is woken up.

Yes!
> 
> *Here* is where an IPI should or should not happen, depending on where
> our receiver task is going to be executed! And where would that be?
> Well, that depends on the Linux scheduler's load balancer, the behavior
> of which is controlled by scheduling domains flags like BALANCE_FORK,
> BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and
> others, but I think these are the most likely ones to be involved
> here).

Probably.
> 
> So, in summary, where the receiver executes when it wakes up on what is
> the configuration of such flags in the (various) scheduling domain(s).
> Check, for instance, this path:
> 
>   try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair()
> 
> The reason why the tests 'reacts' to topology changes is that which set
> of flags is used for the various scheduling domains is, during the time
> the scheduling domains themselves are created and configured-- depends
> on topology... So it's quite possible that exposing the SMT topology,
> wrt to not doing so, makes one of the flag flip in a way which makes
> the benchmark work better.

/me nods.
> 
> If you play with the flags above (or whatever they equivalents were in
> 2.6.39) directly, even without exposing the SMT-topology, I'm quite
> sure you would be able to trigger the same behavior.

I did. And that was the work-around - echo 4xyz flag in the SysFS domain
and suddenly things go much faster.
.
> 
> Regards,
> Dario
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.