[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory


  • To: "Johnson, Ethan" <ejohns48@xxxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Sat, 17 Aug 2019 12:04:04 +0100
  • Authentication-results: esa1.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none; spf=None smtp.pra=andrew.cooper3@xxxxxxxxxx; spf=Pass smtp.mailfrom=Andrew.Cooper3@xxxxxxxxxx; spf=None smtp.helo=postmaster@xxxxxxxxxxxxxxx
  • Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= mQINBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABtClBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPokCOgQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86LkCDQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAYkC HwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
  • Delivery-date: Sat, 17 Aug 2019 11:04:24 +0000
  • Ironport-sdr: oxgq1+NkqO5EB0Rc9TuXGh95C0nGMI7WUWDQfNhjx1E5SeZelQS1zktAVQEqy6eXw+1FIAisyX dKZ/KMwQgyFDnNvd/5aWP2vx4T3ffJiqtkBEns2ni0+D6UJ+bXKj4qv53q9gClgjUQifpSoZ49 S52CfxbbeDBVTzT8lubrFDIsuUodOIiRg6SQEWoDyOs3npSHTD1toxABtC/zVmOdWoCQkPjkgw seWJ8D9vWbSgIKXGCX0ymyhz3iC5qwJZPOv3+5F4eNaYfsNpSOSt9WNUtbtur3j+n47oTN0acI gck=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 16/08/2019 20:51, Johnson, Ethan wrote:
> Hi all,
>
> I have some follow-up questions about Xen's usage and layout of memory, 
> building on the ones I asked here a few weeks ago (which were quite 
> helpfully answered: see 
> https://lists.xenproject.org/archives/html/xen-devel/2019-07/msg01513.html 
> for reference). For context on why I'm asking these questions, I'm using 
> Xen as a research platform for enforcing novel memory protection schemes 
> on hypervisors and guests.
>
> 1. Xen itself lives in the memory region from (on x86-64) 0xffff 8000 
> 0000 0000 - 0xffff 8777 ffff ffff, regardless of whether it's in PV mode 
> or HVM/PVH. Clearly, in PV mode a separate set of page tables (i.e. CR3 
> root pointer) must be used for each guest.

More than that.  Each vCPU.

PV guests manage their own pagetables, and have a vCR3 which the guest
kernel controls, and we must honour.

For 64bit PV guests, each time a new L4 pagetable is created, Xen sets
up its own 16 slots appropriately.  As a result, Xen itself is able to
function appropriately on all pagetable hierarchies the PV guest
creates.  See init_xen_l4_slots() which does this.

For 32bit PV guests, things are a tad more complicated.  Each vCR3 is
actually a PAE-quad of pagetable entries.  Because Xen is still
operating in 64bit mode with 4-level paging, we enforce that guests
allocate a full 4k page for the pagetable (rather than the 32 bytes it
would normally be).

In Xen, we allocate what is called a monitor table, which is per-vcpu
(set up with all the correct details for Xen), and we rewrite slot 0
each time the vCPU changes vCR3.


Not related to this question, but important for future answers.  All
pagetables are actually at a minimum per-domain, because we have
per-domain mappings to simplify certain tasks.  Contained within these
are various structures, including the hypercall compatibility
translation area.  This per-domain restriction can in principle be
lifted if we alter the way Xen chooses to lay out its memory.

> Is that also true of the host 
> (non-extended, i.e. CR3 in VMX root mode) page tables when an HVM/PVH 
> guest is running?

Historical context is important to answer this question.

When the first HVM support came along, there was no EPT or NPT in
hardware.  Hypervisors were required to virtualise the guests pagetable
structure, which is called Shadow Paging in Xen.  The shadow pagetables
themselves are organised per-domain so as to form a single coherent
guest physical address space, but CPUs operating in non-root mode still
needed the real CR3 pointing at the logical vCPU's CR3 which was being
virtualised.

In practice, we still allocate a monitor pagetable per vcpu for HVM
guests, even with HAP support.  I can't think of any restrictions which
would prevent us from doing this differently.

> Or is the dom0 page table left in place, assuming the 
> dom0 is PV, when an HVM/PVH guest is running, since extended paging is 
> now being used to provide the guest's view of memory? Does this change 
> if the dom0 is PVH?

Here is some (prototype) documentation prepared since your last round of
questions.

https://andrewcoop-xen.readthedocs.io/en/docs-devel/admin-guide/introduction.html

Dom0 is just a VM, like every other domU in the system.  There is
nothing special about how it is virtualised.

Dom0 defaults to having full permissions, so can successfully issue a
whole range of more interesting hypercalls, but you could easily create
dom1, set the is_priv boolean in Xen, and give dom1 all the same
permissions that dom0 has, if you wished.

> Or, to ask this from another angle: is there ever anything *but* Xen 
> living in the host-virtual address space when an HVM/PVH guest is 
> active?

No, depending on how you classify Xen's directmap in this context.

> And is the answer to this different depending on whether the 
> HVM/PVH guest is a domU vs. a PVH dom0?

Dom0 vs domU has no relevance to the question.

> 2. Do the mappings in Xen's slice of the host-virtual address space 
> differ at all between the host page tables corresponding to different 
> guests?

No (ish).

Xen has a mostly flat address space, so most of the mappings are the
same.  There is a per-domain mapping slot which is common to each vcpu
in a domain, but different across domains, and a self-linear map for
easy modification of the PTEs for the current pagetable hierarchy, and a
shadow-linear map for easy modification of the shadow PTEs for which Xen
is not in the address space at all.

> If the mappings are in fact the same, does Xen therefore share 
> lower-level page table pages between the page tables corresponding to 
> different guests?

We have many different L4's (the monitor tables, every L4 a PV guest has
allocated) which can run Xen.  Most parts of Xen's address space
converge at L3 (the M2P, the directmap, Xen
text/data/bss/fixmap/vmap/heaps/misc), and are common to all contexts.

The per-domain mapping converges at L3 and are shared between vcpus of
the same guest, but not shared across guests.

One aspect I haven't really covered is XPTI for Meltdown mitigation for
PV guests.  Here, we have a per-CPU private pagetable which ends up
being a merge of most of the guests L4, but with some pre-construct
CPU-private pagetable hierarchy to hide the majority of data in the Xen
region.

> Is any of this different for PV vs. HVM/PVH?

PV guests control their parts of their address space, and can do largely
whatever they choose.  HVM has nothing in the lower canonical half, but
do have an extended directmap (which in practice only makes a difference
on a >5TB machine).

> 3. Under what circumstances, and for what purposes, does Xen use its 
> ability to access guest memory through its direct map of host-physical 
> memory?

That is a very broad question, and currently has the unfortunate answer
of "whenever speculation goes awry in an attackers favour."  There are
steps under way to reduce the usage of the directmap so we can run
without it, and prevent this kind of leakage.

As for when Xen would normally access memory, the most common answer is
for hypercall parameters which mostly use a virtual address based ABI. 
Also, any time we need to emulate an instruction, we need to read a fair
amount of guest state, including reading the instruction under %rip.

> Similarly, to what extent does the dom0 (or other such 
> privileged domain) utilize "foreign memory maps" to reach into another 
> guest's memory? I understand that this is necessary when creating a 
> guest, for live migration, and for QEMU to emulate stuff for HVM guests; 
> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly" 
> access a guest's memory?

I'm not sure what you mean by forcibly.  Dom0 has the ability to do so,
if it chooses.  There is no "force" about it.

Debuggers and/or Introspection are other reasons why dom0 might chose to
map guest RAM, but I think you've covered the common cases.

> (I ask because the research project I'm working on is seeking to protect 
> guests from a compromised hypervisor and dom0, so I need to limit 
> outside access to a guest's memory to explicitly shared pages that the 
> guest will treat as untrusted - not storing any secrets there, vetting 
> input as necessary, etc.)

Sorry to come along with roadblocks, but how on earth do you intend to
prevent a compromised Xen from accessing guest memory?  A compromised
Xen can do almost anything it likes, and without recourse.  This is
ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM
are coming along, because only the hardware itself is in a position to
isolate an untrusted hypervisor/kernel from guest data.

For dom0, that's perhaps easier.  You could reference count the number
of foreign mappings into the domain as it is created, and refuse to
unpause the guests vcpus until the foreign map count has dropped to 0.

> 4. What facilities/processes does Xen provide for PV(H) guests to 
> explicitly/voluntarily share memory pages with Xen and other domains 
> (dom0, etc.)? From what I can gather from the documentation, it sounds 
> like "grant tables" are involved in this - is that how a PV-aware guest 
> is expected to set up shared memory regions for communication with other 
> domains (ring buffers, etc.)?

Yes.  Grant Tables is Xen's mechanism for the coordinated setup of
shared memory between two consenting domains.

> Does a PV(H) guest need to voluntarily 
> establish all external access to its pages, or is there ever a situation 
> where it's the other way around - where Xen itself establishes/defines a 
> region as shared and the guest is responsible for treating it accordingly?

During domain construction, two grants/events are constructed
automatically.  One is for the xenstore ring, and one is for the console
ring.  The latter is so it can get debugging out from very early code,
while both are, in practice, done like this because the guest has no
a-priori way to establish the grants/events itself.

For all other shared interfaces, the guests are expected to negotiate
which grants/events/rings/details to use via Xenstore.

> Again, this mostly boils down to: under what circumstances, if ever, 
> does Xen ever "force" access to any part of a guest's memory? 
> (Particularly for PV(H). Clearly that must happen for HVM since, by 
> definition, the guest is unaware there's a hypervisor controlling its 
> world and emulating hardware behavior, and thus is in no position to 
> cooperatively/voluntarily give the hypervisor and dom0 access to its 
> memory.)

There are cases for all guest types where Xen will need to emulate
instructions.  Xen will access guest memory in order to perfom
architecturally correct actions, which generally starts with reading the
instruction under %rip.

For PV guests, this almost entirely restricted to guest-kernel
operations which are privileged in nature.  Access to MSRs, writes to
pagetables, etc.

For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't
be a complete absence of emulation.  The Local APIC is emulated by Xen
in most cases, as a bare minimum, but for example, the LMSW instruction
on AMD hardware doesn't have any intercept decoding to help the
hypervisor out when a guest uses the instruction.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.