Xen project Mailing List

Re: [Xen-devel] schedulers and topology exposing questions

On Tue, Feb 02, 2016 at 12:45:00PM +0100, Dario Faggioli wrote: > On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote: > > On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote: > > > > > > So, may I ask what piece of (Linux) code are we actually talking > > > about? > > > Because I had a quick look, and could not find where what you > > > describe > > > happens.... > > > > udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout > > The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the > > UDP by having a diffrent timeout. > > > Ha, recvmsg! At some point you mentioned sendmsg, and I was looking > there and seeing nothing! But yes, it indeed makes sense to consider > the receiving side... let me have a look... > > So, it looks to me that this is what happens: > > udp_recvmsg(noblock=0) > | > ---> __skb_recv_datagram(flags=0) { > timeo = sock_rcvtimeo(flags=0) /* returns sk->sk_rcvtimeo */ > do {...} wait_for_more_packets(timeo); > | > ---> schedule_timeor(timeo) > > So, at least in Linux 4.4, the timeout used is the one defined in > sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed > some link wrong, which can well be the case): > > http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31 > #define SO_RCVTIMEO 20 > > So there looks to be a timeout. But anyways, let's check > schedule_timeout(). > > > And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just > > goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER. > > > So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does: > > schedule_timeout(SCHEDULE_TIMEOUT) { > schedule(); > return; > } > > If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a > valid value), the function does: > > schedule_timeout(timeout) { > struct timer_list timer; > > setup_timer_on_stack(&timer); > __mod_timer(&timer); > schedule(); > del_singleshot_timer_sync(&timer); > destroy_timer_on_stack(&timer); > return; > } > > So, in both cases, it pretty much calls schedule() just about > immediately. And when schedule() it's called, the calling process -- > which would be out UDP receiver-- goes to sleep. > > The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not > arrange for anyone to wakeup the thread that is going to sleep. In > theory, it could even be stuck forever... Of course, this depends on > whether the receiver thread is on a runqueue or not, if (in case it's > not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE, > etc., and, in prractice, it never happens! :-D > > In this case, I think we take the other branch (the one 'with > timeout'). But even if we would take this one, I would expect the > receiver thread to not be on any runqueue, but yet to be (either in > interruptible or not state) in a blocking list from where it is taken > out when a packet arrives. > > In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above > is still true, but a timer is set before calling schedule() and putting > the thread to sleep. This means that, in case nothing that would wakeup > such thread happens, or in case it hasn't happened yet when the timeout > expires, the thread is woken up by the timer. Right. > > And in fact, schedule_timeout() is not a different way, with respect to > just calling schedule(), to going to sleep. It is the way you go to > sleep for at most some amount of time... But in all cases, you just and > immediately go to sleep! > > And I also am not sure I see where all that discussion you've had with > George about IPIs fit into this all... The IPI that will trigger the > call to schedule() that will actually put back to execution the thread > that we're putting to sleep in here (i.e., the receiver), happens when > the sender manages to send a packet (actually, when the packet arrives, > I think) _or_ when the timer expires. The IPI were observed when SMT was exposed to the guest. That is because the Linux scheduler put both applications on the same CPU - udp_sender and udp_receiver. Which meant that the 'schedule' call would immediately pick the next application (udp_sender) and schedule it (and send an IPI to itself to do that). > > The two possible calls to schedule() in schedule_timeout() behave > exactly in the same way, and I don't think having a timeout or not is > responsible for any particular behavior. Correct. The quirk was that if the applications were on seperate CPUs - the "thread [would be] woken up by the timer". While if they were on the same CPU - the scheduler would pick the next application on the run-queue (which coincidentally was the UDP sender - or receiver). > > What I think it's happening is this: when such a call to schedule() > (from inside schedule_timeout(), I mean) is made what happens is that > the receiver task just goes to sleep, and another one, perhaps the > sender, is executed. The sender sends the packet, which arrives before > the timeout, and the receiver is woken up. Yes! > > *Here* is where an IPI should or should not happen, depending on where > our receiver task is going to be executed! And where would that be? > Well, that depends on the Linux scheduler's load balancer, the behavior > of which is controlled by scheduling domains flags like BALANCE_FORK, > BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and > others, but I think these are the most likely ones to be involved > here). Probably. > > So, in summary, where the receiver executes when it wakes up on what is > the configuration of such flags in the (various) scheduling domain(s). > Check, for instance, this path: > > try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair() > > The reason why the tests 'reacts' to topology changes is that which set > of flags is used for the various scheduling domains is, during the time > the scheduling domains themselves are created and configured-- depends > on topology... So it's quite possible that exposing the SMT topology, > wrt to not doing so, makes one of the flag flip in a way which makes > the benchmark work better. /me nods. > > If you play with the flags above (or whatever they equivalents were in > 2.6.39) directly, even without exposing the SMT-topology, I'm quite > sure you would be able to trigger the same behavior. I did. And that was the work-around - echo 4xyz flag in the SysFS domain and suddenly things go much faster. . > > Regards, > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.