[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] KEXEC: allocate crash note buffers at boot time



On 30/11/11 09:20, Jan Beulich wrote:
>>>> On 29.11.11 at 19:56, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>> As I have little to no knowledge of this stage of the boot process, is
>> this a sensible way to be setting up the per_cpu areas?  I have a
>> sneaking suspicion that it will fall over if a CPU is onlined after
>> boot, and may also fall over if a CPU is offlined and reonlined later. 
>> There appears to be no infrastructure currently in place for this type
>> of initialization, which is quite possibly why the code exists in its
>> current form.
> I actually wonder how you came to those 4 statements you make in
> the description - none of these seem to me like they are really an
> issue (this would instead be plain bugs in Dom0). Did you actually look
> at the existing Dom0 implementation(s)?
>
> Further, while not being a huge waste of memory, it still is one in case
> kexec gets never enabled, especially when considering a Dom0 kernel
> that's being built without CONFIG_KEXEC (or an incapable on, like any
> pv-ops kernel to date). So I also conceptually question that change.
>
> Jan

We (XenServer) have had many cases of the kexec path failing on customer
boxes under weird and seemingly inexplicable circumstances.  This is why
I am working on trying to bullet-proofing the entire path.

We have 1 ticket where the contents of a crash note is clearly bogus (a
PRSTATUS is not 2GB long).  We have a ticket where it appears that the
kdump kernel has failed to reassemble /proc/vmcore from elfcorehdr as a
few pcpus worth of crash notes are missing.  I seem to remember a ticket
from before my time with a crash while writing the crash notes in Xen
itself. We even have a ticket stating that you get different crash notes
depending on whether you crash using the Xen debug keys (crash notes
appear completely bogus) or /proc/sysrq-trigger in dom0 (seems to be fine).

All of these are uncertain on reproducibility (except the final one
which was shown to reproduce on Xen-3.x and not on Xen-4.x so was not
investigated further) and have a habit of being unreproducible on any of
our local hardware, which makes fixing the problems tricky.

So yes -  the 4 points I have made are certainly not regular or common
behavior, but given some of the tickets we have, I am fairly sure it is
not a bug-free path.  I have checked the 2.6.32 implementation of dom0's
side of this and agree that it looks ok.  However, it is my opinion that
relying on a certain hypercalling pattern from dom0 is a perilous route
for Xen, whether it is likely for that pattern to change in the future
or not.

Having said all of this, I agree with your second paragraph.  As already
noted in my other email in this thread, I need to change the
implementation of this, so I will key the initial allocation of memory
on whether crashkernel= has been passed.  This should be sufficient
indication as to whether the user minds having the space allocated or not.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.