[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: handle_pio looping during domain shutdown, with qemu 4.2.0 in stubdom
> -----Original Message----- > From: Jan Beulich <jbeulich@xxxxxxxx> > Sent: 08 June 2020 09:14 > To: 'Marek Marczykowski-Górecki' <marmarek@xxxxxxxxxxxxxxxxxxxxxx>; > paul@xxxxxxx > Cc: 'Andrew Cooper' <andrew.cooper3@xxxxxxxxxx>; 'xen-devel' > <xen-devel@xxxxxxxxxxxxxxxxxxxx> > Subject: Re: handle_pio looping during domain shutdown, with qemu 4.2.0 in > stubdom > > On 05.06.2020 18:18, 'Marek Marczykowski-Górecki' wrote: > > On Fri, Jun 05, 2020 at 04:39:56PM +0100, Paul Durrant wrote: > >>> From: Jan Beulich <jbeulich@xxxxxxxx> > >>> Sent: 05 June 2020 14:57 > >>> > >>> On 05.06.2020 15:37, Paul Durrant wrote: > >>>>> From: Jan Beulich <jbeulich@xxxxxxxx> > >>>>> Sent: 05 June 2020 14:32 > >>>>> > >>>>> On 05.06.2020 13:05, Paul Durrant wrote: > >>>>>> That would mean we wouldn't be seeing the "Unexpected PIO" message. > >>>>>> From that message this > clearly > >>>>> X86EMUL_UNHANDLEABLE which suggests a race with ioreq server teardown, > >>>>> possibly due to selecting > a > >>>>> server but then not finding a vcpu match in ioreq_vcpu_list. > >>>>> > >>>>> I was suspecting such, but at least the tearing down of all servers > >>>>> happens only from relinquish-resources, which gets started only > >>>>> after ->is_shut_down got set (unless the tool stack invoked > >>>>> XEN_DOMCTL_destroydomain without having observed XEN_DOMINF_shutdown > >>>>> set for the domain). > >>>>> > >>>>> For individually unregistered servers - yes, if qemu did so, this > >>>>> would be a problem. They need to remain registered until all vCPU-s > >>>>> in the domain got paused. > >>>> > >>>> It shouldn't be a problem should it? Destroying an individual server is > >>>> only done with the domain > >>> paused, so no vcpus can be running at the time. > >>> > >>> Consider the case of one getting destroyed after it has already > >>> returned data, but the originating vCPU didn't consume that data > >>> yet. Once that vCPU gets unpaused, handle_hvm_io_completion() > >>> won't find the matching server anymore, and hence the chain > >>> hvm_wait_for_io() -> hvm_io_assist() -> > >>> vcpu_end_shutdown_deferral() would be skipped. handle_pio() > >>> would then still correctly consume the result. > >> > >> True, and skipping hvm_io_assist() means the vcpu internal ioreq state > >> will be left set to > IOREQ_READY and *that* explains why we would then exit hvmemul_do_io() with > X86EMUL_UNHANDLEABLE (from > the first switch). > > > > I can confirm X86EMUL_UNHANDLEABLE indeed comes from the first switch in > > hvmemul_do_io(). And it happens shortly after ioreq server is destroyed: > > > > (XEN) d12v0 XEN_DMOP_remote_shutdown domain 11 reason 0 > > (XEN) d12v0 domain 11 domain_shutdown vcpu_id 0 defer_shutdown 1 > > (XEN) d12v0 XEN_DMOP_remote_shutdown domain 11 done > > (XEN) d12v0 hvm_destroy_ioreq_server called for 11, id 0 > > Can either of you tell why this is? As said before, qemu shouldn't > start tearing down ioreq servers until the domain has made it out > of all shutdown deferrals, and all its vCPU-s have been paused. > For the moment I think the proposed changes, while necessary, will > mask another issue elsewhere. The @releaseDomain xenstore watch, > being the trigger I would consider relevant here, will trigger > only once XEN_DOMINF_shutdown is reported set for a domain, which > gets derived from d->is_shut_down (i.e. not mistakenly > d->is_shutting_down). I can't find anything that actually calls xendevicemodel_shutdown(). It was added by: commit 1462f9ea8f4219d520a530787b80c986e050aa98 Author: Ian Jackson <ian.jackson@xxxxxxxxxxxxx> Date: Fri Sep 15 17:21:14 2017 +0100 tools: libxendevicemodel: Provide xendevicemodel_shutdown Signed-off-by: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx> Acked-by: Wei Liu <wei.liu2@xxxxxxxxxx> Perhaps Ian can shed more light on it? Paul > > Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |