[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL

To: "Tim Deegan" <Tim.Deegan@xxxxxxxxxx>
From: "Jan Beulich" <JBeulich@xxxxxxxxxx>
Date: Mon, 14 Mar 2011 16:08:37 +0000
Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "George@xxxxxxxxxx" <George@xxxxxxxxxx>
Delivery-date: Mon, 14 Mar 2011 09:08:13 -0700
List-id: Xen developer discussion <xen-devel.lists.xensource.com>

>>> On 14.03.11 at 11:52, Tim Deegan <Tim.Deegan@xxxxxxxxxx> wrote:
> At 10:39 +0000 on 14 Mar (1300099174), Jan Beulich wrote:
>> > I think this hang comes because although this code:
>> > 
>> >             cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
>> >             if ( commit )
>> >                CSCHED_PCPU(nxt)->idle_bias = cpu;
>> >             cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
>> > 
>> > removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
>> > have been in cpus in the first place, and none of its siblings are
>> > either since nxt might not be its sibling.
>> 
>> I had originally spent quite a while to verify that the loop this is in
>> can't be infinite (i.e. there's going to be always at least one bit
>> removed from "cpus"), and did so again during the last half hour
>> or so.
> 
> I'm pretty sure there are possible passes through this loop that don't
> remove any cpus, though I haven't constructed the full history that gets
> you there.

Actually, while I don't think that this can happen, something else is
definitely broken here: The logic can select a CPU that's not in the
vCPU's affinity mask. How I managed to not note this when I
originally put this change together I can't tell. I'll send a patch in
a moment, and I think after that patch it's also easier to see that
each iteration will remove at least one bit.

>> > which guarantees that nxt will be removed from cpus, though I suspect
>> > this means that we might not pick the best HT pair in a particular core.
>> > Scheduler code is twisty and hurts my brain so I'd like George's
>> > opinion before checking anything in.
>> 
>> No - that was precisely done the opposite direction to get
>> better symmetry of load across all CPUs. With what you propose,
>> idle_bias would become meaningless.
> 
> I don't think see why it would.  As I said, having picked a core we
> might not iterate to pick the best cpu within that core, but the
> round-robining effect is still there.  And even if not I figured a
> hypervisor crash is worse than a suboptimal scheduling decision. :)

Sure. Just that this code has been there for quite a long time, and
it would be really strange to only now see it start producing hangs
(which apparently aren't that difficult to reproduce - iirc a similar
one was sent around by Ian a few days earlier).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

Follow-Ups:
- Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
  - From: Tim Deegan

References:
- [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
  - From: xen . org
- Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
  - From: Ian Jackson
- Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
  - From: Tim Deegan
- Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
  - From: Jan Beulich
- Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
  - From: Tim Deegan

Prev by Date: Re: [Xen-devel] Re: [PATCH]: Allow tools to map arbitrarily large machphys_mfn_list on 32bit dom0
Next by Date: Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
Previous by thread: Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
Next by thread: Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.