Re: [Xen-devel] About vcpu wakeup and runq tickling in credit

On 15/11/12 12:10, Dario Faggioli wrote:
On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote:
As it comes to possible solution, I think that, for instance, tickling
all the CPUs in both v_W's and v_C's affinity masks could solve this,
but that would also potentially increase the overhead (by asking _a_lot_
of CPUs to reschedule), and again, it's hard to say if/when it's
Well in my code, opt_tickle_idle_one is on by default, which means only
one other cpu will be woken up.  If there were an easy way to make it
wake up a CPU in v_C's affinity as well (supposing that there was no
overlap), that would probably be a win.

Of course, that's only necessary if:
* v_C is lower priority than v_W
* There are no idlers that intersect both v_C and v_W's affinity mask.

It's probably a good idea though to try to set up a scenario where this
might be an issue and see how often it actually happens.

Ok, I think I managed in reproducing this. Look at the following trace,
considering that d51 has vcpu-affinity with pcpus 8-15, while d0 has no
affinity at all (its vcpus can run everywhere):

  166.853945095 ---|-|-------x-| d51v1 runstate_change d0v7 blocked->runnable
]166.853945884 ---|-|-------x-| d51v1   28004(2:8:4) 2 [ 0 7 ]
]166.853986385 ---|-|-------x-| d51v1   2800e(2:8:e) 2 [ 33 4bf97be ]
]166.853986522 ---|-|-------x-| d51v1   2800f(2:8:f) 3 [ 0 a050 1c9c380 ]
]166.853986636 ---|-|-------x-| d51v1   2800a(2:8:a) 4 [ 33 1 0 7 ]
  166.853986775 ---|-|-------x-| d51v1 runstate_change d51v1 running->runnable
  166.853986905 ---|-|-------x-| d?v? runstate_change d0v7 runnable->running
]166.854195353 ---|-|-------x-| d0v7   28006(2:8:6) 2 [ 0 7 ]
]166.854196484 ---|-|-------x-| d0v7   2800e(2:8:e) 2 [ 0 33530 ]
]166.854196584 ---|-|-------x-| d0v7   2800f(2:8:f) 3 [ 33 33530 1c9c380 ]
]166.854196691 ---|-|-------x-| d0v7   2800a(2:8:a) 4 [ 0 7 33 1 ]
  166.854196809 ---|-|-------x-| d0v7 runstate_change d0v7 running->blocked
  166.854197175 ---|-|-------x-| d?v? runstate_change d51v1 runnable->running

So, if I'm not reading the trace wrong, when d0v7 wakes up (very first
event) it preempts d51v1. Now, even if almost all pcpus 8-15 are idle,
none of them get tickled and comes to pick d51v1 up, which has then to
wait in the runq until d0v7 goes back to sleep.

I suspect this could be because, at d0v7 wakeup time, we try to tickle
some pcpu which is in d0v7's affinity, but not in d51v1's one (as in the
second 'if() {}' block in __runq_tickle() we only care about
new->vcpu->cpu_affinity, and in this case, new is d0v7).

I know, looking at the timestamps it doesn't look like it is a big deal
in this case, and I'm still working on producing numbers that can better
show whether or not this is a real problem.

Anyway, and independently from the results of these tests, why do I care
so much?

Well, if you substitute the concept of "vcpu-affinity" with
"node-affinity" above (which is what I am doing in my NUMA aware
scheduling patches) you'll see why this is bothering me quite a bit. In
fact, in that case, waking up a random pcpu with which d0v7 has
node-affinity with, while d51v1 has not, would cause d51v1 being pulled
by that cpu (since node-affinity is only preference)!

So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at
pcpu 13's runq for work to steal it does not find anything suitable and
give up, leaving d51v1 in the runq even if there are idle pcpus on which
it could run, which is already bad.
In the node-affinity case, pcpu 3 will actually manage in stealing d51v1
and running it, even if there are idle pcpus with which it has
node-affinity, and thus defeating most of the benefits of the whole NUMA
aware scheduling thing (at least for some workloads).

Maybe what we should do is do the wake-up based on who is likely to run on the current cpu: i.e., if "current" is likely to be pre-empted, look at idlers based on "current"'s mask; if "new" is likely to be put on the queue, look at idlers based on "new"'s mask.

What do you think?


