[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Flask vs paging mempool - Was: [xen-unstable test] 174809: regressions - trouble: broken/fail/pass
On 11/18/22 16:10, Jason Andryuk wrote: On Fri, Nov 18, 2022 at 12:22 PM Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx> wrote:On 18/11/2022 14:39, Roger Pau Monne wrote:Nov 18 01:55:11.753936 (XEN) arch/x86/mm/hap/hap.c:304: d1 failed to allocate from HAP pool Nov 18 01:55:18.633799 (XEN) Failed to shatter gfn 7ed37: -12 Nov 18 01:55:18.633866 (XEN) d1v0 EPT violation 0x19c (--x/rw-) gpa 0x0000007ed373a1 mfn 0x33ed37 type 0 Nov 18 01:55:18.645790 (XEN) d1v0 Walking EPT tables for GFN 7ed37: Nov 18 01:55:18.645850 (XEN) d1v0 epte 9c0000047eba3107 Nov 18 01:55:18.645893 (XEN) d1v0 epte 9c000003000003f3 Nov 18 01:55:18.645935 (XEN) d1v0 --- GLA 0x7ed373a1 Nov 18 01:55:18.657783 (XEN) domain_crash called from arch/x86/hvm/vmx/vmx.c:3758 Nov 18 01:55:18.657844 (XEN) Domain 1 (vcpu#0) crashed on cpu#8: Nov 18 01:55:18.669781 (XEN) ----[ Xen-4.17-rc x86_64 debug=y Not tainted ]---- Nov 18 01:55:18.669843 (XEN) CPU: 8 Nov 18 01:55:18.669884 (XEN) RIP: 0020:[<000000007ed373a1>] Nov 18 01:55:18.681711 (XEN) RFLAGS: 0000000000010002 CONTEXT: hvm guest (d1v0) Nov 18 01:55:18.681772 (XEN) rax: 000000007ed373a1 rbx: 000000007ed3726c rcx: 0000000000000000 Nov 18 01:55:18.693713 (XEN) rdx: 000000007ed2e610 rsi: 0000000000008e38 rdi: 000000007ed37448 Nov 18 01:55:18.693775 (XEN) rbp: 0000000001b410a0 rsp: 0000000000320880 r8: 0000000000000000 Nov 18 01:55:18.705725 (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 Nov 18 01:55:18.717733 (XEN) r12: 0000000000000000 r13: 0000000000000000 r14: 0000000000000000 Nov 18 01:55:18.717794 (XEN) r15: 0000000000000000 cr0: 0000000000000011 cr4: 0000000000000000 Nov 18 01:55:18.729713 (XEN) cr3: 0000000000400000 cr2: 0000000000000000 Nov 18 01:55:18.729771 (XEN) fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000002 Nov 18 01:55:18.741711 (XEN) ds: 0028 es: 0028 fs: 0000 gs: 0000 ss: 0028 cs: 0020 It seems to be related to the paging pool adding Andrew and Henry so that he is aware.Summary of what I've just given on IRC/Matrix. This crash is caused by two things. First (XEN) FLASK: Denying unknown domctl: 86. because I completely forgot to wire up Flask for the new hypercalls. But so did the original XSA-409 fix (as SECCLASS_SHADOW is behind CONFIG_X86), so I don't feel quite as bad about this.Broken for ARM, but not for x86, right? I think SECCLASS_SHADOW is available in the policy bits - it's just whether or not the hook functions are available?And second because libxl ignores the error it gets back, and blindly continues onward. Anthony has posted "libs/light: Propagate libxl__arch_domain_create() return code" to fix the libxl half of the bug, and I posted a second libxl bugfix to fix an error message. Both are very simple. For Flask, we need new access vectors because this is a common hypercall, but I'm unsure how to interlink it with x86's shadow control. This will require a bit of pondering, but it is probably easier to just leave them unlinked.It sort of seems like it could go under domain2 since domain/domain2 have most of the memory stuff, but it is non-PV. shadow has its own set of hooks. It could go in hvm which already has some memory stuff. Since the new hypercall is for managing a memory pool for any domain, though HVM is the only one supported today, imho it belongs under domain/domain2. Something to consider is that there is another guest memory pool that is managed, the PoD pool, which has a dedicated privilege for it. This leads me to the question of whether access to manage the PoD pool and the paging pool size should be separate accesses or whether they should be under the same access. IMHO I believe it should be the latter as I can see no benefit in disaggregating access to the PoD pool and the paging pool. In fact I find myself thinking in terms of should the managing domain have control over the size of any backing memory pools for the target domain. I am not seeing any benefit to discriminating between which backing memory pool a managing domain should be able to manage. With that said, I am open to being convinced otherwise. Since this is an XSA fix that will be backported, moving get/set PoD hypercalls under a new permission would be too disruptive. I would recommend introducing the permission set/getmempools under the domain access vector, which will only control access to the paging pool. Then planning can occur for 4.18 to look at transitioning get/set PoD target to being controlled via get/setmempools. Flask is listed as experimental which means it doesn't technically matter if we break it, but it is used by OpenXT so not fixing it for 4.17 would be rather rude.It's definitely nicer to have functional Flask in the release. OpenXT can use a backport if necessary, so it doesn't need to be a release blocker. Having said that, Flask is a nice feature of Xen, so it would be good to have it functioning in 4.17. As maintainer I would really prefer not to see 4.17 go out with any part of XSM broken. While it is considered experimental, which I hope to rectify, it is a long standing feature that has been kept stable, and for which there is a sizeable user base. IMHO I think it deserves a proper fix before release. V/r, Daniel P. Smith
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |