[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
From: Mathieu Tarral <mathieu.tarral@xxxxxxxxxxxxxx>
Date: Tue, 11 Jun 2019 12:31:56 +0000
Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Tamas K Lengyel <tamas@xxxxxxxxxxxxx>
Delivery-date: Tue, 11 Jun 2019 12:32:14 +0000
Feedback-id: 7ARND6YmrAEqSXE0j3TLm6ZqYiFFaDDEkO_KW8fTUEW0kYwGM1KEsuPxEPVWH5YuEnR43INtqwIKH-usvnxVQQ==:Ext:ProtonMail
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi Andrew, Tamas,

Le mercredi, mai 29, 2019 2:49 AM, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> a écrit :

After a series of tests on 1 or 4 VCPUs, my domain end up in 2 possible states: - frozen: the mouse doesn't move: so I would guess the VCPU are blocked. I'm calling the xc_(un)pause_domain APIs multiple times when I write to the shadow copies, but It's always synchronous, so I doubt that they interfered and "paused" the domain.

xc_{,un}pause_domain() calls are reference counted. Calling unpause too many times should be safe (from a refcount point of view), and should fail the hypercall with -EINVAL.
Also, the log output I have before I detect that Ansible failed to execute is that the resume succeded and
Xen is ready to process VMI events.

- BSOD: that's the second possibility, apparently I'm corrupting critical data structure in the operating system,
and the Windbg analysis is inconclusive, so I can't tell much.

Either way, I can't execute this test sequentially 10 000 times without a crash.
Ok good - this is a far easier place to start debugging from.
-> Could you look at the implementation, and tell me if I misused the APIs somewhere ?
https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7
Some observations.

1) In xen_pause_vm(), you do an xc_domain_getinfo(). First of all, the API is crazy, so you also need to check "|| info.domid != domid" in your error condition, but that shouldn't be an issue here as the domid isn't changing.

"The API is crazy"

Could you elaborate on that ?

Furthermore, the results of xc_domain_getinfo() are stale before the hypercall returns, so it is far less useful than it appears.

I looked at libvmi's implementation:

https://github.com/libvmi/libvmi/blob/master/libvmi/driver/xen/xen.c#L2696

I guess I should have implemented the checks too.

They just didn't make sense for me as I was sure that my calls were synchronous, one after the other.

I'm afraid that the only safe way to issue pause/unpauses is to know that you've reference counted your own correctly. All entities in dom0 with privilege can fight over each others references, because there is nothing Xen can use to distinguish the requests.

-> I removed my calls to xc_pause/resume.

2) You allocate a libxl context but do nothing with it. That can all go, along with the linkage against libxl. Also, you don't need to create a logger like that. Despite being utterly unacceptable behaviour for a library, it is the default by passing NULL in xc_interface_open().

I followed drakvuf's xen init function:

https://github.com/tklengyel/drakvuf/blob/master/src/xen_helper/xen_helper.c#L140

As I thought I was going to need this at some point.

Same for the xl_logger initialization.

3) A malloc()/memset() pair is more commonly spelt calloc()

True.

And some questions.

1) I'm guessing the drakvuf_inject_trap(drakvuf, 0x293e6a0, 0) call is specific to the exact windows kernel in use?

Yes, I used a libvmi python script to tanslate the symbol -> virtual address -> physical address.

Then I replaced that value in my code and recompiled the binary before the test.

2) In vmi_init(), what is the purpose of fmask and zero_page_gfn? You add one extra gfn to the guest, called zero_page, and fill it with 1's from fmask.

3) You create two altp2m's, but both have the same default access. Is this deliberate, or a bug? If deliberate, why?

Finally, and probably the source of the memory corruption...

4) When injecting a trap, you allocate a new gfn, memcpy() the contents and insert a 0xcc (so far so good). You then remap the executable view to point at the new gfn with a breakpoint in (fine), and remap the readable view to point at the zero_page, which is full of 1's (uh-oh).

What is this final step trying to achieve? It guarantees that patch-guard will eventually notice and BSOD your VM for critical structure corruption. The read-only view needs to point to the original gfn with only read permissions, so when Windows reads the gfn back, it sees what it expects. You also need to prohibit writes to either gfn so you can spot writes (unlikely in this case but important for general introspection) so you can propagate the change to both copies.

Yes I missed that PatchGuard would eventually check those shadow pages anyway.

I was already happy to see that my breakpoints were working, and I proceeded to the tests

hoping to have a quick reproduction of the bug.

I implemented a basic mem_access event on the restricting to --X only on the original GFN being remapped,

and switching to hostp2m and singlestepping to escape PatchGuard.

It works, but I end up in a situation where Xen fails at some point, because at ~90 tests, it cannot populate the ring anymore:

INFO:root:==== test 92 ====

INFO:root:starting drakvuf

INFO:root:starting Ansible

INIT

xen_init_interface

xc_interface_open

create logger

allocating libxc context

init ring page

xc: error: Failed to populate ring pfn

(16 = Device or resource busy): Internal error

fail to enable monitoring: Device or resource busy

fail to init xen interface

Fail to init vmi

(I updated the Gist: https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7#file-xen-drakvuf-c)

What do you think happened ?

I have a call to xc_domain_setmaxmem with ~0, so it shouldn't happen ?

https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7#file-xen-drakvuf-c-L598

I used the compat APIs, like Drakvuf does.

@Tamas, if you could check the traps implementation.

You also have stress-test.py, which is the small test suite that I used, and
the screenshot showing the stdout preceding a test failure,
when Ansible couldn't contact WinRM service because the domain was frozen.

Note: I stole some code from libvmi, to handle page read/write in Xen.

PS: in the case where the domain is frozen, and I destroy the domain, a (null) entry will remain
in xl list, despite that my stress-test.py process is already dead.

I have 4 of these entries in my xl list right now.

That's almost certainly a reference not being dropped on a page. Can you run `xl debug-keys q` and paste the resulting analysis which will be visible in `xl dmesg`?

It is probably some missing cleanup in the altp2m code.

I just checked, and I had a few "xen-drakvuf" processes still running in the background.

When the main Python process raised an exception and terminated, they became attached to systemd.

Killing them removed the (null) domain entries.

Thanks,

Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

Follow-Ups:
- Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  - From: Mathieu Tarral

Prev by Date: Re: [Xen-devel] [PATCH v3] xen: introduce VCPUOP_register_runstate_phys_memory_area hypercall
Next by Date: Re: [Xen-devel] [PATCH v7 08/10] x86/microcode: Synchronize late microcode loading
Previous by thread: [Xen-devel] [PATCH] automation: Fix CI with the fedora container
Next by thread: Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
Index(es):
- Date
- Thread