Re: [Xen-devel] [PATCH 5/5] mm: Don't hold heap lock in alloc_heap_pages() longer than necessary

On 08/30/2017 09:06 AM, Andrew Cooper wrote:
> On 30/08/17 13:59, Boris Ostrovsky wrote:
>>>> This patch has been applied to staging, but its got problems.  The
>>>> following crash is rather trivial to provoke:
>>>> ~Andrew
>>>> (d19) Test result: SUCCESS
>>>> (XEN) ----[ Xen-4.10-unstable  x86_64  debug=y   Tainted:    H ]----
>>>> (XEN) CPU:    5
>>>> (XEN) RIP:    e008:[<ffff82d0802252fc>] 
>>>> page_alloc.c#free_heap_pages+0x786/0x7a1
>>>> ...
>>>> (XEN) Pagetable walk from ffff82ffffffffe4:
>>>> (XEN)  L4[0x105] = 00000000abe5b063 ffffffffffffffff
>>>> (XEN)  L3[0x1ff] = 0000000000000000 ffffffffffffffff
>>> Some negative offset into somewhere, it seems. Upon second
>>> look I think the patch is simply wrong in its current shape:
>>> free_heap_pages() looks for page_state_is(..., free) when
>>> trying to merge chunks, while alloc_heap_pages() now sets
>>> PGC_state_inuse outside of the locked area. I'll revert it right
>>> away.
>> Yes, so we do need to update page state under heap lock. I'll then move
>> scrubbing (and checking) only to outside the lock.
>> I am curious though, what was the test to trigger this? I ran about 100
>> parallel reboots under memory pressure and never hit this.
> # git clone git://xenbits.xen.org/xtf.git
> # cd xtf
> # make -j4 -s
> # ./xtf-runner -qa
> Purposefully, ./xtf-runner doesn't synchronously wait for VMs to be
> fully destroyed before starting the next test.  (There is an ~800ms
> added delay to synchronously destroy HVM guests, over PV, which I expect
> is down to an interaction with qemu.  I got sufficiently annoyed that I
> coded around the issue.)
> As a result, destruction of one domain will be happening while
> construction of the next one is happening.

I was also doing overlapped destruction/construction but at random (so
overlaps didn't happen all the time).

xtf-runner indeed tripped this panic fairly quickly.


