[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option

To: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Date: Tue, 28 Jul 2015 11:15:28 -0400
Cc: "security@xxxxxxxxxx" <security@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, X86 ML <x86@xxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>, Steven Rostedt <rostedt@xxxxxxxxxxx>, Andy Lutomirski <luto@xxxxxxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Andy Lutomirski <luto@xxxxxxxxxx>, Sasha Levin <sasha.levin@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>
Delivery-date: Tue, 28 Jul 2015 15:16:30 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, Jul 28, 2015 at 10:50:39AM -0400, Boris Ostrovsky wrote:
> On 07/28/2015 10:35 AM, Andrew Cooper wrote:
> >On 28/07/15 15:05, Boris Ostrovsky wrote:
> >>On 07/28/2015 06:29 AM, Andrew Cooper wrote:
> >>>>>After forward-porting my virtio patches, I got this thing to run on
> >>>>>Xen.  After several tries, I got:
> >>>>>
> >>>>>[   53.985707] ------------[ cut here ]------------
> >>>>>[   53.986314] kernel BUG at arch/x86/xen/enlighten.c:496!
> >>>>>[   53.986677] invalid opcode: 0000 [#1] SMP
> >>>>>[   53.986677] Modules linked in:
> >>>>>[   53.986677] CPU: 0 PID: 1400 Comm: bash Not tainted 4.2.0-rc4+ #4
> >>>>>[   53.986677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> >>>>>BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
> >>>>>04/01/2014
> >>>>>[   53.986677] task: c2376180 ti: c0874000 task.ti: c0874000
> >>>>>[   53.986677] EIP: 0061:[<c10530f2>] EFLAGS: 00010282 CPU: 0
> >>>>>[   53.986677] EIP is at set_aliased_prot+0xb2/0xc0
> >>>>>[   53.986677] EAX: ffffffea EBX: cc3d1000 ECX: 0672e063 EDX: 80000000
> >>>>>[   53.986677] ESI: 00000000 EDI: 80000000 EBP: c0875e94 ESP: c0875e74
> >>>>>[   53.986677]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> >>>>>[   53.986677] CR0: 80050033 CR2: b77404d4 CR3: 020b6000 CR4: 00042660
> >>>>>[   53.986677] Stack:
> >>>>>[   53.986677]  80000000 0672e063 000021c0 cc3d1000 00000001 cc3d2000
> >>>>>00000b4a 00000200
> >>>>>[   53.986677]  c0875ea8 c105312d c2317940 c2373a80 00000000 c0875eb4
> >>>>>c1062310 c01861c0
> >>>>>[   53.986677]  c0875ec0 c1062735 c01861c0 c0875ed4 c10a764e c7007a00
> >>>>>c2373a80 00000000
> >>>>>[   53.986677] Call Trace:
> >>>>>[   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
> >>>>>[   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
> >>>>>[   53.986677]  [<c1062735>] destroy_context+0x25/0x40
> >>>>>[   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
> >>>>>[   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
> >>>>>[   53.986677]  [<c1863736>] __schedule+0x316/0x950
> >>>>>[   53.986677]  [<c1863d96>] schedule+0x26/0x70
> >>>>>[   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
> >>>>>[   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
> >>>>>[   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
> >>>>>[   53.986677]  [<c186717a>] syscall_call+0x7/0x7
> >>>>>[   53.986677] Code: e8 c1 e3 0c 81 eb 00 00 00 40 39 5d ec 74 11 8b
> >>>>>4d e4 8b 55 e0 31 f6 e8 dd e0 fa ff 85 c0 75 0d 83 c4 14 5b 5e 5f 5d
> >>>>>c3 90 0f 0b <0f> 0b 0f 0b 8d 76 00 8d bc 27 00 00 00 00 85 d2 74 31 55
> >>>>>89 e5
> >>>>>[   53.986677] EIP: [<c10530f2>] set_aliased_prot+0xb2/0xc0 SS:ESP
> >>>>>0069:c0875e74
> >>>>>[   54.010069] ---[ end trace 89ac35b29c1c59bb ]---
> >>>>>
> >>>>>Is that the error you're seeing?
> >>>>>
> >>>>>If I change xen_free_ldt to:
> >>>>>
> >>>>>static void xen_free_ldt(struct desc_struct *ldt, unsigned entries)
> >>>>>{
> >>>>>      const unsigned entries_per_page = PAGE_SIZE / LDT_ENTRY_SIZE;
> >>>>>      int i;
> >>>>>
> >>>>>      vm_unmap_aliases();
> >>>>>      xen_mc_flush();
> >>>>>
> >>>>>      for(i = 0; i < entries; i += entries_per_page)
> >>>>>          set_aliased_prot(ldt + i, PAGE_KERNEL);
> >>>>>}
> >>>>>
> >>>>>then it works.  I don't know why this makes a difference.
> >>>>>(xen_mc_flush makes a little bit of sense to me.  vm_unmap_aliases
> >>>>>doesn't.)
> >>>>>
> >>>>That fix makes sense if there's some way that the vmalloc area we're
> >>>>freeing has an extra alias somewhere, which is very much possible.  On
> >>>>the other hand, I don't see how this happens without first doing an
> >>>>MMUEXT_SET_LDT with an unexpectedly aliased address, and I would have
> >>>>expected that to blow up and/or result in test case failures.
> >>>>
> >>>>But I'm still confused, because it seems like Xen will never populate
> >>>>the actual (hidden) LDT mapping unless the pages backing it are
> >>>>unaliased and well-formed, which make me wonder why this stuff ever
> >>>>worked.  Wouldn't LDT access with pre-existing vmalloc aliases result
> >>>>in segfaults?
> >>>>
> >>>>The semantics seem to be very odd.  xen_free_ldt with an aliased
> >>>>address might fail (and OOPS), but actual access to the LDT with an
> >>>>aliased address page faults.
> >>>>
> >>>>Also, using kzalloc for everything fixes the problem, which suggests
> >>>>that there really is something to my theory that the problem involves
> >>>>unexpected aliases.
> >>>Xen does lazily populate the LDT frames.  The first time a page is ever
> >>>referenced via the LDT, Xen will perform a typechange.
> >>>
> >>>Under Xen, guest mappings are reference counted with both a plain
> >>>reference, and a type count.  Types of writeable, segdec and pagetables
> >>>are mutually exclusive.  This prevents the guest from having writeable
> >>>mappings of interesting datastructures, but readable mappings are fine.
> >>>Typechanges may only occur when the type reference count is 0.
> >>>
> >>>At the point of the typechange, no writeable mappings of the frame may
> >>>exist (and it must not be referenced by a L2 or greater page directory),
> >>>or the typechange will fail.  Additionally the descriptors are audited
> >>>at this point, so if Xen objects to any of the descriptors in the same
> >>>page, the typechange will also fail.
> >>>
> >>>If the typechange fails, the pagefault gets propagated back to the
> >>>guest.
> >>>
> >>>The corollary to this is that, for xen_free_ldt() to create writeable
> >>>mappings again, a typechange back to writeable is needed.  This will
> >>>fail if the LDT frames are still referenced in any vcpus LDT.
> >>>
> >>>It would be interesting to know which of the two BUG()s in
> >>>set_aliased_prot() tripped.
> >>The first one (i.e. not the alias)
> >>
> >In which case the page in question is still referenced in an LDT
> >(perhaps on a different vcpu)
> 
> The problem is reproducible on a UP guest so it's not that.

The Linux kernel does a bunch of lazy maps and unmaps and we
may be getting an interrupt while the lazy unmap hasn't been
called  (arch_leave_lazy_mmu_mode).

Having the vm_unmap_aliases and then xc_mc_flush (which is what
arch_leave_lazy_mmu_mode ends up doing too and more) would solve it.

Thought I would have thought that vm_unmap_aliases would call
arch_leave_lazy_mmu_mode.
> 
> >or has been reused as a pagetable (I
> >really hope this is not the case).
> >
> >A sufficiently-debug Xen might be persuaded into telling you exactly
> >what it didn't like about the attempted transition.
> 
> It just can't find l1 entry for the LDT address in __do_update_va_mapping().

Which would imply that it has not been written in. Which corresponds
to the set_aliased_prot hitting the first BUG_ON.

The xc_mc_flush() also triggers the batched hypercalls - which means we
may have some hypercalls that have not yet gone to the hypervisor and
then we try do an LDT hypercall (not batched).

You could try building with this debug:


diff --git a/arch/x86/xen/multicalls.c b/arch/x86/xen/multicalls.c
index ea54a08..5d214ce 100644
--- a/arch/x86/xen/multicalls.c
+++ b/arch/x86/xen/multicalls.c
@@ -28,9 +28,9 @@
 #include "multicalls.h"
 #include "debugfs.h"
 
-#define MC_BATCH       32
+#define MC_BATCH       1
 
-#define MC_DEBUG       0
+#define MC_DEBUG       1
 
 #define MC_ARGS                (MC_BATCH * 16)
 
> 
> -boris
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Boris Ostrovsky

References:
- [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Andy Lutomirski
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Boris Ostrovsky
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Andy Lutomirski
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Boris Ostrovsky
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Andy Lutomirski
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Andy Lutomirski
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Andrew Cooper
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Boris Ostrovsky
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Andrew Cooper
- Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
  - From: Boris Ostrovsky

Prev by Date: Re: [Xen-devel] [OSSTEST PATCH 4/4] sg-run-job: Provide infrastructure for layers of nesting
Next by Date: Re: [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
Previous by thread: Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
Next by thread: Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.