[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] kexec -e in PVHVM guests (and in PV).



Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> writes:

> Hey, 
>
> I had on my todo list an patch from Olaf patch that shuffles
> the shared_page to be in the 0xFE700000 addr (in the "gap"
> with newer QEMU's) which unfortunately did not work when
> migrating on 32-bit PVHVM guests on Xen 4.1.
>
> The commit is 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f
> "xen PVonHVM: use E820_Reserved area for shared_info" and it
> ended up being reverted. I dusted it off and I think I found
> the original bug (and fixed it), but while digging in this
> the more I discovered a ton more of issues.
>
> A bit about the use case - the 'kexec -e' allows one to
> restart the Linux kernel without a reboot. It is not a crash kernel
> so it is just meant to restart and work, and then restart, etc.
>
> The 'kdump -c' (crash) is a different use case and I had not
> thought much about it. But I think that all of the solutions
> I am thinking of will make it also work. (so you could
> do kexec-crash -> kexec-e->kexec-e>kexec-crash->kexec-e, and
> so, if you would want to).
>
> The problem I uncovered was that the memory region where
> the new kernel would be executed had bits of memory changed - which
> meant that the purgatory code in kexec would detect the SHA1SUM
> being incorrect and not load. That lead me to find out that
> VCPUOP_register_vcpu_info was the culprit (well, the xen_vcpu_info
> was being modified, and its PFN was in the 'new' kernel image area).
>
> Anyhow, the end result of that is that I think to get this
> working we would need to have:
>
>  1). A symmetrical VCPUOP_register_vcpu_info call, say
>      VCPUOP_unregister_vcpu_info, which would for a provided vpuid
>      set 'vcpu_info' to the shared_info, and 'vcpu_info_mfn' to
>      INVALID_MFN. Naturally the vcpu_id has to be down (_VPF_down).
>      A prototype patch along with an naive implementation in
>      the Linux kernel made this work surprisingly well!
>
>      The Linux kernel had to call on the shutdown the:
>      disable_nonboot_cpus() which would bring all the AP CPUs down.
>      Each AP CPU would call said hypercall. Also on each CPU
>      bringup we would call this (that is the BSP would make this
>      call before bringing the AP CPUs up - on bootup paths it
>      would result in nothing, while for an kexec -c type kernel
>      it would allow us to use the CPUs).
>
>  2). Ditto for VCPUOP_register_runtime and
>      VCPUOP_register_runstate_memory_area.  They would need a
>      similar 'unregister' call with similar semantics as the
>      one above.
>
>  3). The shared_info. Olaf's patch stuck the shared_info in the
>      "gaps" of the E820 or the E820_RSRV region. But the recent patches
>       for PCI passthrough are making me twitchy and I think we would
>       need to parse the E820 and /proc/ioports (so 'resource API in
>       Linux kernel' to figure out a good place to stash this. Or on
>       shutdown (kexec -e)  balloon out the shared region (need to
>       double check that this possible in the first place).
>
>  4). Balloon memory. I am not really sure how to deal with that. The
>      guest might have ballooned out tons of memory but the new kernel
>      won't know about it until the xen/balloon driver kicks in and
>      figures this out based on XenStore. Then it will try to balloon
>      out.. and depending on its luck - balloon out memory that was
>      already ballooned out, or not.  Also during the bootup of
>      the 'kexec -e' kernel it might touch pages that had been
>      ballooned out - and try to use them!
>
>  5). Events. Olaf had written code long time ago that would poke the
>      events to see if they were already in use (-EEXIST) and if so
>      re-use them - it works great albeit there are tons of messages
>      in the Xen ring buffer. The Linux patch I wrote did an
>      'disable_nonboot_cpus' and also tore down the BSP interrupts - that
>      meant that all of the events were nicely torn down. This all works
>      on non-FIFO event.  David Vrabel says that to make this work
>      (re-use or teardown and bring up) would be hard.
>
>  6). QEMU PnP typ devices. Such as 'serial,'i8042', and 'rtc' end up
>      going through the EVTCHNOP_bind_pirg. Somehow on the 'kexec -e'
>      kernel we end up doing OK, but the devices don't work anymore.
>      That is - the serial input does not accept any more input (but
>      it can output alright).
>
>  7). Grants. Andrew Cooper hinted at this and a bit of experimentation
>      shows that Xen hypervisor will indeed smack down any guest that
>      tries to re-use its "old" grants. I am not even sure if the
>      GNTTAB_setup call is returning the "old" grant frames.
>      His suggestion was 'GNTTAB_reset' to well, reset everything.
>
> My thinking is that a lot of this code is shared with PV (and PVH)
> once this is fixed we could do full scale 'kexec -e' in an PV
> (or PVH) type guest. Doing dom0 kexec -e would be an interesting
> experiment :-(
>
> I am unable to fix this for Xen 4.5 and I am not sure what other
> issues there are present. If folks have some ideas or would like to
> chime in (or even pick some of these up!)- please do respond.
>

I have one more issue related to kexec/kdump topic I'm investigating
right now. 

When kdump happens and new kernel boots we have /proc/vmcore
device. There is no problem in reading from this device, however
makedumpfile reads it with mmap() by default and that doesn't work for
me.

I figured out the following: there are several pages (2 in my case) in
vmcore which are not ram. read_from_oldmem() calls special pfn_is_ram()
check (which does HVMOP_get_mem_type and these pages are reported as
HVMMEM_mmio_dm) and skips them. mmap_vmcore() doesn't have this check
and we got these pages mapped. When we do memcpy() from them we get
stuck in case we try reading more than 16 bytes (that's weird).

I have 'quick and dirty' patch which brings pfn_is_ram() check to
mmap_vmcore() and replaces all HVMMEM_mmio_dm pages with an empty
page. I'm going to investigate a bit more here.

I can try looking at something from the above as well. E.g. I was able
to solve no.6 with the following (yes, dirty again) patch:

commit 23a224c4ad664dfc6fe672f74f83549387efebda
Author: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
Date:   Wed Jun 18 14:12:15 2014 +0200

    wip: unmap all pirqs
    
    Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index dfa12a4..16af7e4 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1658,6 +1719,35 @@ void xen_callback_vector(void) {}
 static bool fifo_events = true;
 module_param(fifo_events, bool, 0);
 
+static void unmap_all_pirqs(void)
+{
+       struct evtchn_status status;
+       int port, rc = -ENOENT;
+       struct physdev_unmap_pirq unmap_irq;
+       struct evtchn_close close;
+
+       memset(&status, 0, sizeof(status));
+       for (port = 0; port < xen_evtchn_max_channels(); port++) {
+               status.dom = DOMID_SELF;
+               status.port = port;
+               rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status);
+               if (rc < 0)
+                       continue;
+               pr_warn("unmap_all_pirqs: port: %d, status: %d\n", status.port, 
status.status);
+               if (status.status == EVTCHNSTAT_pirq) {
+                       close.port = port;
+                       if (HYPERVISOR_event_channel_op(EVTCHNOP_close, &close) 
!= 0)
+                               pr_warn("EVTCHNSTAT_pirq: failed to close event 
channel %d\n", port);
+                       unmap_irq.pirq = status.u.pirq;
+                       unmap_irq.domid = DOMID_SELF;
+                       pr_warn("unmapping previously mapped pirq %d\n", 
unmap_irq.pirq);
+                       if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, 
&unmap_irq) != 0)
+                               pr_warn("failed to unmap pirq %d\n", 
unmap_irq.pirq);
+               }
+       }
+}
+
+
 void __init xen_init_IRQ(void)
 {
        int ret = -EINVAL;
@@ -1686,6 +1776,8 @@ void __init xen_init_IRQ(void)
                xen_callback_vector();
 
        if (xen_hvm_domain()) {
+               unmap_all_pirqs();
+
                native_init_IRQ();
                /* pci_xen_hvm_init must be called after native_init_IRQ so that
                 * __acpi_register_gsi can point at the right function */

-- 
  Vitaly

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.