[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
>>> On 14.03.11 at 11:52, Tim Deegan <Tim.Deegan@xxxxxxxxxx> wrote: > At 10:39 +0000 on 14 Mar (1300099174), Jan Beulich wrote: >> > I think this hang comes because although this code: >> > >> > cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers); >> > if ( commit ) >> > CSCHED_PCPU(nxt)->idle_bias = cpu; >> > cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu)); >> > >> > removes the new cpu and its siblings from cpus, cpu isn't guaranteed to >> > have been in cpus in the first place, and none of its siblings are >> > either since nxt might not be its sibling. >> >> I had originally spent quite a while to verify that the loop this is in >> can't be infinite (i.e. there's going to be always at least one bit >> removed from "cpus"), and did so again during the last half hour >> or so. > > I'm pretty sure there are possible passes through this loop that don't > remove any cpus, though I haven't constructed the full history that gets > you there. Actually, while I don't think that this can happen, something else is definitely broken here: The logic can select a CPU that's not in the vCPU's affinity mask. How I managed to not note this when I originally put this change together I can't tell. I'll send a patch in a moment, and I think after that patch it's also easier to see that each iteration will remove at least one bit. >> > which guarantees that nxt will be removed from cpus, though I suspect >> > this means that we might not pick the best HT pair in a particular core. >> > Scheduler code is twisty and hurts my brain so I'd like George's >> > opinion before checking anything in. >> >> No - that was precisely done the opposite direction to get >> better symmetry of load across all CPUs. With what you propose, >> idle_bias would become meaningless. > > I don't think see why it would. As I said, having picked a core we > might not iterate to pick the best cpu within that core, but the > round-robining effect is still there. And even if not I figured a > hypervisor crash is worse than a suboptimal scheduling decision. :) Sure. Just that this code has been there for quite a long time, and it would be really strange to only now see it start producing hangs (which apparently aren't that difficult to reproduce - iirc a similar one was sent around by Ian a few days earlier). Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |