[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v2] Fix scheduler crash after s3 resume
>>> On 25.01.13 at 11:35, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx> wrote: > Am 25.01.2013 11:31, schrieb Jan Beulich: >>>>> On 25.01.13 at 11:23, Juergen Gross<juergen.gross@xxxxxxxxxxxxxx> wrote: >>> Am 25.01.2013 11:15, schrieb Jan Beulich: >>>>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@xxxxxxxxxx> >>>>>>> wrote: >>>> >>>>>> I think I had already raised the question of the placement of >>>>>> this rcu_barrier() here, and the lack of a counterpart in the >>>>>> suspend portion of the path. Keir? Or should >>>>>> rcu_barrier_action() avoid calling process_pending_softirqs() >>>>>> while still resuming, and instead call __do_softirq() with all but >>>>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper, >>>>>> or alternatively by open-coding its effect)? >>>>>> >>>>> Though I recall these vcpu_wake crashes happen also from other entry >>>>> points in enter_state but rcu_barrier, so I dont think removing that >>>>> helps much. Just was unable to get a proper log of them today due to >>>>> most of them being cut in half. Will try bit more. >>>> >>>> In which case making __do_softirq() itself honor being in the >>>> suspend/resume path might still be an option. >>>> >>>>> My belief is that as long as vcpu_migrate is not called in >>>>> cpu_disable_scheduler, the vcpu->processor shall continue to point to >>>>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu. >>>>> If vcpu_migrate is called, then vcpu_wake will still be called with some >>>>> frequency but since vcpu->processor shall point to online cpu, and it >>>>> won't crash. So likely avoiding the wakes here completely is not the >>>>> goal, just the offline ones. >>>> >>>> But you neglect the fact that waking vCPU-s at this point is >>>> unnecessary anyway (they have nowhere to run on). >>> >>> What about adding a global scheduler_disable() in freeze_domains() and a >>> scheduler_enable() in thaw_domains() which will switch scheduler locking to >>> a global lock (or disable it at all?). This should solve all problems >>> without >>> any complex changes of current behavior. >> >> I don't see how this would address the so far described >> shortcomings. > > The crash happens due to an access to the scheduler percpu area which isn't > allocated at the moment. The accessed element is the address of the > scheduler > lock for this cpu. Disabling the percpu locking scheme of the scheduler > while > the non-boot cpus are offline will avoid the crash. Ah, okay. But that wouldn't prevent other bad effects that could result from vCPU-s pointing to offline pCPU-s. Hence I think such a solution, even if sufficient for now, would set us up for future similar (and similarly hard to debug) issues. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |