[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Flask vs paging mempool - Was: [xen-unstable test] 174809: regressions - trouble: broken/fail/pass

To: Jason Andryuk <jandryuk@xxxxxxxxx>, Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>
From: "Daniel P. Smith" <dpsmith@xxxxxxxxxxxxxxxxxxxx>
Date: Sun, 20 Nov 2022 06:08:14 -0500
Arc-authentication-results: i=1; mx.zohomail.com; dkim=pass header.i=apertussolutions.com; spf=pass smtp.mailfrom=dpsmith@xxxxxxxxxxxxxxxxxxxx; dmarc=pass header.from=<dpsmith@xxxxxxxxxxxxxxxxxxxx>
Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1668942497; h=Content-Type:Content-Transfer-Encoding:Cc:Date:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:To; bh=10Dq8n20jXc8W6amE2OsE6o/Jj6s5bwGfjdY3iz1oU4=; b=YNEWL6I6Wi0gt1rmFwwdXPDuRt9bmkexrg95VAHHO3Oat0fRp1GrnkewA9UOqkWGsHQJfew138YFYqxR8FApNpZ6aRPYc9N0T4bGK+XmQwz85RtkpDHdLdmHnmKm5Q+zeZ2UybbxkDuWCpWd13yKWpEh+R1JN5G2MhU3y0rZMQ8=
Arc-seal: i=1; a=rsa-sha256; t=1668942497; cv=none; d=zohomail.com; s=zohoarc; b=kEw47x2QrPTksyBHc/8JvHHX4rF+TZgUCdoprA3fI3Qy/3xbBjjnVNdc852kKvnPip7lZVxjZ1gysNE8fLeIqCMBHYmPpSdqFqWLfy2maOfEMT3jTFsK0Xj4WVIZJz4yHUDHTHc8ItmdKS6yki/62rz7RUw5oddD4irtTnv/T0Q=
Cc: Roger Pau Monne <roger.pau@xxxxxxxxxx>, Henry Wang <Henry.Wang@xxxxxxx>, Anthony Perard <anthony.perard@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
Delivery-date: Sun, 20 Nov 2022 11:08:48 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 11/18/22 16:10, Jason Andryuk wrote:

On Fri, Nov 18, 2022 at 12:22 PM Andrew Cooper
<Andrew.Cooper3@xxxxxxxxxx> wrote:


On 18/11/2022 14:39, Roger Pau Monne wrote:

Nov 18 01:55:11.753936 (XEN) arch/x86/mm/hap/hap.c:304: d1 failed to allocate 
from HAP pool
Nov 18 01:55:18.633799 (XEN) Failed to shatter gfn 7ed37: -12
Nov 18 01:55:18.633866 (XEN) d1v0 EPT violation 0x19c (--x/rw-) gpa 
0x0000007ed373a1 mfn 0x33ed37 type 0
Nov 18 01:55:18.645790 (XEN) d1v0 Walking EPT tables for GFN 7ed37:
Nov 18 01:55:18.645850 (XEN) d1v0  epte 9c0000047eba3107
Nov 18 01:55:18.645893 (XEN) d1v0  epte 9c000003000003f3
Nov 18 01:55:18.645935 (XEN) d1v0  --- GLA 0x7ed373a1
Nov 18 01:55:18.657783 (XEN) domain_crash called from 
arch/x86/hvm/vmx/vmx.c:3758
Nov 18 01:55:18.657844 (XEN) Domain 1 (vcpu#0) crashed on cpu#8:
Nov 18 01:55:18.669781 (XEN) ----[ Xen-4.17-rc  x86_64  debug=y  Not tainted 
]----
Nov 18 01:55:18.669843 (XEN) CPU:    8
Nov 18 01:55:18.669884 (XEN) RIP:    0020:[<000000007ed373a1>]
Nov 18 01:55:18.681711 (XEN) RFLAGS: 0000000000010002   CONTEXT: hvm guest 
(d1v0)
Nov 18 01:55:18.681772 (XEN) rax: 000000007ed373a1   rbx: 000000007ed3726c   
rcx: 0000000000000000
Nov 18 01:55:18.693713 (XEN) rdx: 000000007ed2e610   rsi: 0000000000008e38   
rdi: 000000007ed37448
Nov 18 01:55:18.693775 (XEN) rbp: 0000000001b410a0   rsp: 0000000000320880   
r8:  0000000000000000
Nov 18 01:55:18.705725 (XEN) r9:  0000000000000000   r10: 0000000000000000   
r11: 0000000000000000
Nov 18 01:55:18.717733 (XEN) r12: 0000000000000000   r13: 0000000000000000   
r14: 0000000000000000
Nov 18 01:55:18.717794 (XEN) r15: 0000000000000000   cr0: 0000000000000011   
cr4: 0000000000000000
Nov 18 01:55:18.729713 (XEN) cr3: 0000000000400000   cr2: 0000000000000000
Nov 18 01:55:18.729771 (XEN) fsb: 0000000000000000   gsb: 0000000000000000   
gss: 0000000000000002
Nov 18 01:55:18.741711 (XEN) ds: 0028   es: 0028   fs: 0000   gs: 0000   ss: 
0028   cs: 0020

It seems to be related to the paging pool adding Andrew and Henry so
that he is aware.


Summary of what I've just given on IRC/Matrix.

This crash is caused by two things.  First

   (XEN) FLASK: Denying unknown domctl: 86.

because I completely forgot to wire up Flask for the new hypercalls.
But so did the original XSA-409 fix (as SECCLASS_SHADOW is behind
CONFIG_X86), so I don't feel quite as bad about this.


Broken for ARM, but not for x86, right?

I think SECCLASS_SHADOW is available in the policy bits - it's just
whether or not the hook functions are available?

And second because libxl ignores the error it gets back, and blindly
continues onward.  Anthony has posted "libs/light: Propagate
libxl__arch_domain_create() return code" to fix the libxl half of the
bug, and I posted a second libxl bugfix to fix an error message.  Both
are very simple.


For Flask, we need new access vectors because this is a common
hypercall, but I'm unsure how to interlink it with x86's shadow
control.  This will require a bit of pondering, but it is probably
easier to just leave them unlinked.


It sort of seems like it could go under domain2 since domain/domain2
have most of the memory stuff, but it is non-PV.  shadow has its own
set of hooks.  It could go in hvm which already has some memory stuff.

Since the new hypercall is for managing a memory pool for any domain,though HVM is the only one supported today, imho it belongs underdomain/domain2.

Something to consider is that there is another guest memory pool that ismanaged, the PoD pool, which has a dedicated privilege for it. Thisleads me to the question of whether access to manage the PoD pool andthe paging pool size should be separate accesses or whether they shouldbe under the same access. IMHO I believe it should be the latter as Ican see no benefit in disaggregating access to the PoD pool and thepaging pool. In fact I find myself thinking in terms of should themanaging domain have control over the size of any backing memory poolsfor the target domain. I am not seeing any benefit to discriminatingbetween which backing memory pool a managing domain should be able tomanage. With that said, I am open to being convinced otherwise.

Since this is an XSA fix that will be backported, moving get/set PoDhypercalls under a new permission would be too disruptive. I wouldrecommend introducing the permission set/getmempools under the domainaccess vector, which will only control access to the paging pool. Thenplanning can occur for 4.18 to look at transitioning get/set PoD targetto being controlled via get/setmempools.

Flask is listed as experimental which means it doesn't technically
matter if we break it, but it is used by OpenXT so not fixing it for
4.17 would be rather rude.


It's definitely nicer to have functional Flask in the release.  OpenXT
can use a backport if necessary, so it doesn't need to be a release
blocker.  Having said that, Flask is a nice feature of Xen, so it
would be good to have it functioning in 4.17.

As maintainer I would really prefer not to see 4.17 go out with any partof XSM broken. While it is considered experimental, which I hope torectify, it is a long standing feature that has been kept stable, andfor which there is a sizeable user base. IMHO I think it deserves aproper fix before release.


V/r,
Daniel P. Smith

Follow-Ups:
- Re: Flask vs paging mempool - Was: [xen-unstable test] 174809: regressions - trouble: broken/fail/pass
  - From: Jan Beulich

References:
- [xen-unstable test] 174809: regressions - trouble: broken/fail/pass
  - From: osstest service owner
- Re: [xen-unstable test] 174809: regressions - trouble: broken/fail/pass
  - From: Roger Pau Monné
- Flask vs paging mempool - Was: [xen-unstable test] 174809: regressions - trouble: broken/fail/pass
  - From: Andrew Cooper
- Re: Flask vs paging mempool - Was: [xen-unstable test] 174809: regressions - trouble: broken/fail/pass
  - From: Jason Andryuk

Prev by Date: [xen-unstable bisection] complete test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm
Next by Date: [linux-linus test] 174852: regressions - FAIL
Previous by thread: Re: Flask vs paging mempool - Was: [xen-unstable test] 174809: regressions - trouble: broken/fail/pass
Next by thread: Re: Flask vs paging mempool - Was: [xen-unstable test] 174809: regressions - trouble: broken/fail/pass
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.