Xen project Mailing List

Re: [Xen-devel] kexec -e in PVHVM guests (and in PV).

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>

Date: Tue, 01 Jul 2014 10:12:58 +0200

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, daniel.kiper@xxxxxxxxxx

Delivery-date: Tue, 01 Jul 2014 08:13:06 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> writes: > Hey, > > I had on my todo list an patch from Olaf patch that shuffles > the shared_page to be in the 0xFE700000 addr (in the "gap" > with newer QEMU's) which unfortunately did not work when > migrating on 32-bit PVHVM guests on Xen 4.1. > > The commit is 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f > "xen PVonHVM: use E820_Reserved area for shared_info" and it > ended up being reverted. I dusted it off and I think I found > the original bug (and fixed it), but while digging in this > the more I discovered a ton more of issues. > > A bit about the use case - the 'kexec -e' allows one to > restart the Linux kernel without a reboot. It is not a crash kernel > so it is just meant to restart and work, and then restart, etc. > > The 'kdump -c' (crash) is a different use case and I had not > thought much about it. But I think that all of the solutions > I am thinking of will make it also work. (so you could > do kexec-crash -> kexec-e->kexec-e>kexec-crash->kexec-e, and > so, if you would want to). > > The problem I uncovered was that the memory region where > the new kernel would be executed had bits of memory changed - which > meant that the purgatory code in kexec would detect the SHA1SUM > being incorrect and not load. That lead me to find out that > VCPUOP_register_vcpu_info was the culprit (well, the xen_vcpu_info > was being modified, and its PFN was in the 'new' kernel image area). > > Anyhow, the end result of that is that I think to get this > working we would need to have: > > 1). A symmetrical VCPUOP_register_vcpu_info call, say > VCPUOP_unregister_vcpu_info, which would for a provided vpuid > set 'vcpu_info' to the shared_info, and 'vcpu_info_mfn' to > INVALID_MFN. Naturally the vcpu_id has to be down (_VPF_down). > A prototype patch along with an naive implementation in > the Linux kernel made this work surprisingly well! > > The Linux kernel had to call on the shutdown the: > disable_nonboot_cpus() which would bring all the AP CPUs down. > Each AP CPU would call said hypercall. Also on each CPU > bringup we would call this (that is the BSP would make this > call before bringing the AP CPUs up - on bootup paths it > would result in nothing, while for an kexec -c type kernel > it would allow us to use the CPUs). > > 2). Ditto for VCPUOP_register_runtime and > VCPUOP_register_runstate_memory_area. They would need a > similar 'unregister' call with similar semantics as the > one above. > > 3). The shared_info. Olaf's patch stuck the shared_info in the > "gaps" of the E820 or the E820_RSRV region. But the recent patches > for PCI passthrough are making me twitchy and I think we would > need to parse the E820 and /proc/ioports (so 'resource API in > Linux kernel' to figure out a good place to stash this. Or on > shutdown (kexec -e) balloon out the shared region (need to > double check that this possible in the first place). > > 4). Balloon memory. I am not really sure how to deal with that. The > guest might have ballooned out tons of memory but the new kernel > won't know about it until the xen/balloon driver kicks in and > figures this out based on XenStore. Then it will try to balloon > out.. and depending on its luck - balloon out memory that was > already ballooned out, or not. Also during the bootup of > the 'kexec -e' kernel it might touch pages that had been > ballooned out - and try to use them! > > 5). Events. Olaf had written code long time ago that would poke the > events to see if they were already in use (-EEXIST) and if so > re-use them - it works great albeit there are tons of messages > in the Xen ring buffer. The Linux patch I wrote did an > 'disable_nonboot_cpus' and also tore down the BSP interrupts - that > meant that all of the events were nicely torn down. This all works > on non-FIFO event. David Vrabel says that to make this work > (re-use or teardown and bring up) would be hard. > > 6). QEMU PnP typ devices. Such as 'serial,'i8042', and 'rtc' end up > going through the EVTCHNOP_bind_pirg. Somehow on the 'kexec -e' > kernel we end up doing OK, but the devices don't work anymore. > That is - the serial input does not accept any more input (but > it can output alright). > > 7). Grants. Andrew Cooper hinted at this and a bit of experimentation > shows that Xen hypervisor will indeed smack down any guest that > tries to re-use its "old" grants. I am not even sure if the > GNTTAB_setup call is returning the "old" grant frames. > His suggestion was 'GNTTAB_reset' to well, reset everything. > > My thinking is that a lot of this code is shared with PV (and PVH) > once this is fixed we could do full scale 'kexec -e' in an PV > (or PVH) type guest. Doing dom0 kexec -e would be an interesting > experiment :-( > > I am unable to fix this for Xen 4.5 and I am not sure what other > issues there are present. If folks have some ideas or would like to > chime in (or even pick some of these up!)- please do respond. > I have one more issue related to kexec/kdump topic I'm investigating right now. When kdump happens and new kernel boots we have /proc/vmcore device. There is no problem in reading from this device, however makedumpfile reads it with mmap() by default and that doesn't work for me. I figured out the following: there are several pages (2 in my case) in vmcore which are not ram. read_from_oldmem() calls special pfn_is_ram() check (which does HVMOP_get_mem_type and these pages are reported as HVMMEM_mmio_dm) and skips them. mmap_vmcore() doesn't have this check and we got these pages mapped. When we do memcpy() from them we get stuck in case we try reading more than 16 bytes (that's weird). I have 'quick and dirty' patch which brings pfn_is_ram() check to mmap_vmcore() and replaces all HVMMEM_mmio_dm pages with an empty page. I'm going to investigate a bit more here. I can try looking at something from the above as well. E.g. I was able to solve no.6 with the following (yes, dirty again) patch: commit 23a224c4ad664dfc6fe672f74f83549387efebda Author: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> Date: Wed Jun 18 14:12:15 2014 +0200 wip: unmap all pirqs Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index dfa12a4..16af7e4 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -1658,6 +1719,35 @@ void xen_callback_vector(void) {} static bool fifo_events = true; module_param(fifo_events, bool, 0); +static void unmap_all_pirqs(void) +{ + struct evtchn_status status; + int port, rc = -ENOENT; + struct physdev_unmap_pirq unmap_irq; + struct evtchn_close close; + + memset(&status, 0, sizeof(status)); + for (port = 0; port < xen_evtchn_max_channels(); port++) { + status.dom = DOMID_SELF; + status.port = port; + rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status); + if (rc < 0) + continue; + pr_warn("unmap_all_pirqs: port: %d, status: %d\n", status.port, status.status); + if (status.status == EVTCHNSTAT_pirq) { + close.port = port; + if (HYPERVISOR_event_channel_op(EVTCHNOP_close, &close) != 0) + pr_warn("EVTCHNSTAT_pirq: failed to close event channel %d\n", port); + unmap_irq.pirq = status.u.pirq; + unmap_irq.domid = DOMID_SELF; + pr_warn("unmapping previously mapped pirq %d\n", unmap_irq.pirq); + if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq) != 0) + pr_warn("failed to unmap pirq %d\n", unmap_irq.pirq); + } + } +} + + void __init xen_init_IRQ(void) { int ret = -EINVAL; @@ -1686,6 +1776,8 @@ void __init xen_init_IRQ(void) xen_callback_vector(); if (xen_hvm_domain()) { + unmap_all_pirqs(); + native_init_IRQ(); /* pci_xen_hvm_init must be called after native_init_IRQ so that * __acpi_register_gsi can point at the right function */ -- Vitaly _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.