[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86/ioreq server: Fix DomU couldn't reboot when using p2m_ioreq_server p2m_type



On 09/05/17 06:21, Yu Zhang wrote:
> 
> 
> On 5/8/2017 7:12 PM, George Dunlap wrote:
>> On 08/05/17 11:52, Zhang, Xiong Y wrote:
>>>>>>> On 06.05.17 at 03:51, <xiong.y.zhang@xxxxxxxxx> wrote:
>>>>>>>>> On 05.05.17 at 05:52, <xiong.y.zhang@xxxxxxxxx> wrote:
>>>>>>> 'commit 1679e0df3df6 ("x86/ioreq server: asynchronously reset
>>>>>>> outstanding p2m_ioreq_server entries")' will call
>>>>>>> p2m_change_entry_type_global() which set entry.recalc=1. Then
>>>>>>> the following get_entry(p2m_ioreq_server) will return
>>>>>>> p2m_ram_rw type.
>>>>>>> But 'commit 6d774a951696 ("x86/ioreq server: synchronously reset
>>>>>>> outstanding p2m_ioreq_server entries when an ioreq server unmaps")'
>>>>>>> assume get_entry(p2m_ioreq_server) will return p2m_ioreq_server
>>>>>>> type, then reset p2m_ioreq_server entries. The fact is the
>>>>>>> assumption
>>>>>>> isn't true, and sysnchronously reset function couldn't work. Then
>>>>>>> ioreq.entry_count is larger than zero after an ioreq server unmaps,
>>>>>>> finally this results DomU couldn't reboot.
>>>>>> I've had trouble understanding this part already on v1 (btw, why is
>>>>>> this one not tagged v2?), and since I still can't figure it I have
>>>>>> to ask:
>>>>>> Why is it that guest reboot is being impacted here? From what I
>>>>>> recall
>>>>>> a non-zero count should only prevent migration.
>>>>> [Zhang, Xiong Y] Sorry, although they solve the same issue, the
>>>>> solution is
>>>>> totally different, so I didn't mark this as V2, I will mark the
>>>>> following
>>>>> as v2 with this solution.
>>>>> During DomU reboot, it will first unmap ioreq server in shutdown
>>>>> process,
>>>>> then it call map ioreq server in boot process. The following
>>>>> sentence in
>>>>> p2m_set_ioreq_server() result mapping ioreq server failure, then DomU
>>>>> couldn't continue booting.
>>>>> If ( read_atomic(&p->ioreq.entry_count))
>>>>>     goto out;
>>>> It is clear that it would be this statement to be the problem one,
>>>> but I continue to not see why this would affect reboot: The rebooted
>>>> guest runs in another VM with, hence, a different p2m. I cannot see
>>>> why there would be a non-zero ioreq.entry_count the first time an
>>>> ioreq server claims the p2m_ioreq_server type for this new domain.
>>>>
>>> [Zhang, Xiong Y] This is what I see from xl dmesg when a DomU reboot
>>> 1) unmap io_req_server with old domid
>>> 2) map io_req_server with old domid
>>> 3)unmap io_req_server with old domid
>>> 4) map io_req_server with new domid
>>>
>>> The 1) and 2) are triggered by our device reset handler in qemu, it will
>>> destroy old device handler, then create device handler with the old
>>> domid
>>> again. so we could see ioreq.entry_could > 0 with old domid, then reboot
>>> process terminated.
>> Oh, so it prevents reboot of XenGT, but not of normal guests?
>>
>> Why does a reboot cause the device to detach, re-attach, and then
>> re-detach again?
>>
>> Also, I'm sorry for missing the bug during review, but it's a bit
>> annoying to find out that the core functionality of patch -- detaching
>> and re-attaching -- wasn't tested at all before submission.
> 
> Thanks for your reply, George.
> This error was introduced in the last version of patch "x86/ioreq
> server: Asynchronously reset
> outstanding p2m_ioreq_server entries.", which included 2 changes from
> its previous version:
> 
> 1> Do not reset p2m_ioreq_server back to p2m_ram_rw when an ioreq server
> is mapped;
> 
> 2> Use a helper function to return the p2m type -  this new helper
> routine returns p2m_ram_rw
> instead of p2m_ioreq_server if there's no mapping ioreq server. In
> previous versions I just returned
> p2m_ioreq_server, and XenGT tests all passed successfully. But I had not
> realized this issue for this
> last version and did not have enough time to do the test for this
> version in the last moment before
> code freeze...

Indeed, we were under a lot of time pressure and you did spend a lot of
time outside of your normal office hours to try to get things fixed; and
your (plural) testing did eventually turn this up before the release.
Sorry for being a bit irritable.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.