Xen project Mailing List

Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

To: Stefan Bader <stefan.bader@xxxxxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Wed, 27 Aug 2014 16:49:40 -0400

Cc: Linux Kernel Mailing List <linux-kernel@xxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Kees Cook <keescook@xxxxxxxxxxxx>, David Vrabel <david.vrabel@xxxxxxxxxx>

Delivery-date: Wed, 27 Aug 2014 20:50:01 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, Aug 27, 2014 at 10:03:10AM +0200, Stefan Bader wrote: > On 26.08.2014 18:01, Konrad Rzeszutek Wilk wrote: > > On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote: > >> On 21.08.2014 18:03, Kees Cook wrote: > >>> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk > >>> <konrad.wilk@xxxxxxxxxx> wrote: > >>>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote: > >>>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader > >>>>> <stefan.bader@xxxxxxxxxxxxx> wrote: > >>>>>> On 12.08.2014 19:28, Kees Cook wrote: > >>>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader > >>>>>>> <stefan.bader@xxxxxxxxxxxxx> wrote: > >>>>>>>> On 08.08.2014 14:43, David Vrabel wrote: > >>>>>>>>> On 08/08/14 12:20, Stefan Bader wrote: > >>>>>>>>>> Unfortunately I have not yet figured out why this happens, but can > >>>>>>>>>> confirm by > >>>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that > >>>>>>>>>> without KASLR all > >>>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does > >>>>>>>>>> not even boot > >>>>>>>>>> as a follow up error). > >>>>>>>>>> > >>>>>>>>>> Details can be seen in [1] but basically this is always some > >>>>>>>>>> portion of a > >>>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE > >>>>>>>>>> space not being > >>>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). > >>>>>>>>>> In the > >>>>>>>>>> non-dom0 case this repeats many times but ends in a guest that > >>>>>>>>>> allows login. In > >>>>>>>>>> the dom0 case there is a more fatal error at some point causing a > >>>>>>>>>> crash. > >>>>>>>>>> > >>>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also > >>>>>>>>>> does not help > >>>>>>>>>> to add "nokaslr" to the kernel command-line. > >>>>>>>>> > >>>>>>>>> Maybe it's overlapping with regions of the virtual address space > >>>>>>>>> reserved for Xen? What the the VA that fails? > >>>>>>>>> > >>>>>>>>> David > >>>>>>>>> > >>>>>>>> Yeah, there is some code to avoid some regions of memory (like > >>>>>>>> initrd). Maybe > >>>>>>>> missing p2m tables? I probably need to add debugging to find the > >>>>>>>> failing VA (iow > >>>>>>>> not sure whether it might be somewhere in the stacktraces in the > >>>>>>>> report). > >>>>>>>> > >>>>>>>> The kernel-command line does not seem to be looked at. It should put > >>>>>>>> something > >>>>>>>> into dmesg and that never shows up. Also today's random feature is > >>>>>>>> other PV > >>>>>>>> guests crashing after a bit somewhere in the check_for_corruption > >>>>>>>> area... > >>>>>>> > >>>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If > >>>>>>> there are other reserved regions that aren't listed in the e820, it'll > >>>>>>> need to locate and skip them. > >>>>>>> > >>>>>>> -Kees > >>>>>>> > >>>>>> Making my little steps towards more understanding I figured out that > >>>>>> it isn't > >>>>>> the code that does the relocation. Even with that completely disabled > >>>>>> there were > >>>>>> the vmalloc issues. What causes it seems to be the default of the > >>>>>> upper limit > >>>>>> and that this changes the split between kernel and modules to 1G+1G > >>>>>> instead of > >>>>>> 512M+1.5G. That is the reason why nokaslr has no effect. > >>>>> > >>>>> Oh! That's very interesting. There must be some assumption in Xen > >>>>> about the kernel VM layout then? > >>>> > >>>> No. I think most of the changes that look at PTE and PMDs are are all > >>>> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being > >>>> too aggressive > >>> > >>> (Sorry I had to cut our chat short at Kernel Summit!) > >>> > >>> I sounded like there was another region of memory that Xen was setting > >>> aside for page tables? But Stefan's investigation seems to show this > >>> isn't about layout at boot (since the kaslr=0 case means no relocation > >>> is done). Sounds more like the split between kernel and modules area, > >>> so I'm not sure how the memory area after the initrd would be part of > >>> this. What should next steps be, do you think? > >> > >> Maybe layout, but not about placement of the kernel. Basically leaving > >> KASLR > >> enabled but shrink the possible range back to the original kernel/module > >> split > >> is fine as well. > >> > >> I am bouncing between feeling close to understand to being confused. Konrad > >> suggested xen_cleanhighmap being overly aggressive. But maybe its the > >> other way > >> round. The warning that occurs first indicates that PTE that was obtained > >> for > >> some vmalloc mapping is not unused (0) as it is expected. So it feels > >> rather > >> like some cleanup has *not* been done. > >> > >> Let me think aloud a bit... What seems to cause this, is the change of the > >> kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M > >> vsyscalls and 2M hole at the end). Which in vaddr terms means: > >> > >> Before: > >> ffffffff80000000 - ffffffff9fffffff (=512 MB) kernel text mapping, from > >> phys 0 > >> ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space > >> > >> After: > >> ffffffff80000000 - ffffffffbfffffff (=1024 MB) kernel text mapping, from > >> phys 0 > >> ffffffffc0000000 - ffffffffff5fffff (=1014 MB) module mapping space > >> > >> Now, *if* I got this right, this means the kernel starts on a vaddr that is > >> pointed at by: > >> > >> PGD[510]->PUD[510]->PMD[0]->PTE[0] > >> > >> In the old layout the module vaddr area would start in the same PUD area, > >> but > >> with the change the kernel would cover PUD[510] and the module vaddr + > >> vsyscalls > >> and the hole would cover PUD[511]. > > > > I think there is a fixmap there too? > > Right, they forgot that in Documentation/x86/x86_64/mm... but head_64.S has > it. > So fixmap seems to be in the 2M space before the vsyscalls. > Btw, apparently I got the PGD index wrong. It is of course 511, not 510. > > init_level4_pgt[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..255]->kernel > [256..511]->mod > [511]->level2_fixmap_pgt[0..505]->mod > [506]->fixmap > > [507..510]->vsysc > [511]->hole > > With the change being level2_kernel_pgt completely covering kernel only. > > >> > >> xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating > >> a bit > >> since I am not sure I understand enough details) I believe is the one PMD > >> pointed at by PGD[510]->PUD[510]. That could mean that before the change > > > > That sounds right. > > > > I don't know if you saw: > > > > 1248 #ifdef DEBUG > > > > 1249 /* This is superflous and is not neccessary, but you know what > > > > 1250 * lets do it. The MODULES_VADDR -> MODULES_END should be > > clear of > > 1251 * anything at this stage. */ > > > > 1252 xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, > > PUD_SIZE) - 1); > > 1253 #endif > > > > 1254 } > > I saw that but it would have no effect, even with running it. Because > xen_cleanhighmap clamps the pmds it walks over to the kernel_level2_pgt page. > Now MODULES_VADDR is mapped only from level2_fixmap_pgt. > Even with the old layout it might do less that anticipated as it would only > cover 512M and stop then. But I think it really does not matter. > > > > Which was me being a bit paranoid and figured it might help in > > troubleshooting. > > If you disable that does it work? > > > >> xen_cleanhighmap may touch some (the initial 512M) of the module vaddr > >> space but > >> not after the change. Maybe that also means it always should have covered > >> more > >> but this would not be observed as long as modules would not claim more than > >> 512M? I still need to check the vaddr ranges for which xen_cleanhighmap is > >> actually called. The modules vaddr space would normally not be touched > >> (only > >> with DEBUG set). I moved that to be unconditionally done but then this > >> might be > >> of no use when it needs to cover a different PMD... > > > > What does the toolstack say in regards to allocating the memory? It is > > pretty > > verbose (domainloginfo..something) in printing out the vaddr of where > > it stashes the kernel, ramdisk, P2M, and the pagetables (which of course > > need to fit all within the 512MB, now 1GB area). > > That is taken from starting a 2G PV domU with pvgrub (not pygrub): > > Xen Minimal OS! > start_info: 0xd90000(VA) > nr_pages: 0x80000 > shared_inf: 0xdfe92000(MA) > pt_base: 0xd93000(VA) > nr_pt_frames: 0xb > mfn_list: 0x990000(VA) > mod_start: 0x0(VA) > mod_len: 0 > flags: 0x0 > cmd_line: > stack: 0x94f860-0x96f860 > MM: Init > _text: 0x0(VA) > _etext: 0x6000d(VA) > _erodata: 0x78000(VA) > _edata: 0x80b00(VA) > stack start: 0x94f860(VA) > _end: 0x98fe68(VA) > start_pfn: da1 > max_pfn: 80000 > Mapping memory range 0x1000000 - 0x80000000 > setting 0x0-0x78000 readonly > > > For a moment I was puzzled by the use of max_pfn_mapped in the generic > cleanup_highmap function of 64bit x86. It limits the cleanup to the start of > the > mfn_list. And the max_pfn_mapped value changes soon after to reflect the total > amount of memory of the guest. > Making a copy showed it to be around 51M at the time of cleanup. That > initially > looks suspect but Xen already replaced the page tables. The compile-time > variants would have 2M large pages on the whole level2_kernel_pgt range. But > as > far as I can see, the Xen provided ones don't put in mappings for anything > beyond the provided boot stack which is clean in the xen_cleanhighmap. > > So not much further... but then I think I know what I do next. Probably should > have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic > and at least get a crash dump of that situation when it occurs. Then I can dig > in there with crash (really should have thought of that before)... <nods> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there that screams at me, so I fear I will have to wait until you get the crash and get some clues from that. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.