Xen project Mailing List

Re: [PATCH v3 1/4] x86/spec-ctrl: add logic to issue IBPB on exit to guest

Date: Thu, 26 Jan 2023 09:02:13 +0100

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=meOV+7T8ZQZ/6PKje+wXQ6cgl9ZXye4EOfKH5lcTyQ4=; b=OkctuZUuewoh6/Aq01ALyVPv64iNHCMTCD0jPuHpjdnMcd09k8bIIwM7kZlah1B97n8EbPcyYMzeZQWooMb2iHVAJvXrWlXqt2OCgxhtyi7EXBt7V5V4YhRx5Vx+8Txl+LgXY8CmJtVNrvmLNePkmFZwfj043n3opAsZcCu7ETVsYlBNXoi6aw3PSFP5FKlX//mAwLc3ytRDsckFrgXIlEUFobunFsImWjOXG18vpDTFIT+02CSIzDwLu1q5yIbsiK9PCBxD2vRku2g3yF4UNT/yyZmJavt8o6lZa26rmbKn8S1PPedzC2fwfTSUYLddgtx4dT9JJXHfQQO5y5s7ag==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=OtNvsD/z76VgcckoTJnfYWAECdfk7jyxaTyCUmgOc9eXM9pr1u6LWBbZcQ03f6+L7115/zaMPWmFkk7rTtjqFm+QRQPjM4Alq+uR2kSKkTM5qzJ/API/VdvzuvQX6Jn8pJBst4GL98nHBq6G5O77rsScCHCRog7f4bbMJMUYsq3VG3KpsfCRXyOs3MLmk6DsIYmoYOx3rJ+7lB/y5dbb+ve09Xt+DcZ5lTv5JA4K6KccgkvMmTrvSyKeFW7QAXS9Wmr22JoSv3aAwkprUOr2l/GItlig/UaNLKr2dMFrdfp959uKKfhliJNmCT8ZFadt4VyffkYi/WBuv09hLwdyDA==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 26 Jan 2023 08:02:36 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 25.01.2023 22:10, Andrew Cooper wrote: > On 25/01/2023 3:25 pm, Jan Beulich wrote: >> In order to be able to defer the context switch IBPB to the last >> possible point, add logic to the exit-to-guest paths to issue the >> barrier there, including the "IBPB doesn't flush the RSB/RAS" >> workaround. Since alternatives, for now at least, can't nest, emit JMP >> to skip past both constructs where both are needed. This may be more >> efficient anyway, as the sequence of NOPs is pretty long. > > It is very uarch specific as to when a jump is less overhead than a line > of nops. > > In all CPUs liable to be running Xen, even unconditional jumps take up > branch prediction resource, because all branch prediction is pre-decode > these days, so branch locations/types/destinations all need deriving > from %rip and "history" alone. > > So whether a branch or a line of nops is better is a tradeoff between > how much competition there is for branch prediction resource, and how > efficiently the CPU can brute-force its way through a long line of nops. > > But a very interesting datapoint. It turns out that AMD Zen4 CPUs > macrofuse adjacent nops, including longnops, because it reduces the > amount of execute/retire resources required. And a lot of > kernel/hypervisor fastpaths have a lot of nops these days. > > > For us, the "can't nest" is singularly more important than any worry > about uarch behaviour. We've frankly got much lower hanging fruit than > worring about one branch vs a few nops. > >> LFENCEs are omitted - for HVM a VM entry is immanent, which already >> elsewhere we deem sufficiently serializing an event. For 32-bit PV >> we're going through IRET, which ought to be good enough as well. While >> 64-bit PV may use SYSRET, there are several more conditional branches >> there which are all unprotected. > > Privilege changes are serialsing-ish, and this behaviour has been > guaranteed moving forwards, although not documented coherently. > > CPL (well - privilege, which includes SMM, root/non-root, etc) is not > written speculatively. So any logic which needs to modify privilege has > to block until it is known to be an architectural execution path. > > This gets us "lfence-like" or "dispatch serialising" behaviour, which is > also the reason why INT3 is our go-to speculation halting instruction. > Microcode has to be entirely certain we are going to deliver an > interrupt/exception/etc before it can start reading the IDT/etc. > > Either way, we've been promised that all instructions like IRET, > SYS{CALL,RET,ENTER,EXIT}, VM{RUN,LAUNCH,RESUME} (and ERET{U,S} in the > future FRED world) do, and shall continue to not execute speculatively. > > Which in practice means we don't need to worry about Spectre-v1 attack > against codepaths which hit an exit-from-xen path, in terms of skipping > protections. > > We do need to be careful about memory accesses and potential double > dereferences, but all the data is on the top of the stack for XPTI > reasons. About the only concern is v->arch.msrs->* in the HVM path, and > we're fine with the current layout (AFAICT). > >> >> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> >> --- >> I have to admit that I'm not really certain about the placement of the >> IBPB wrt the MSR_SPEC_CTRL writes. For now I've simply used "opposite of >> entry". > > It really doesn't matter. They're independent operations that both need > doing, and are fully serialising so can't parallelise. > > But on this note, WRMSRNS and WRMSRLIST are on the horizon. The CPUs > which implement these instructions are the ones which also ought not to > need any adjustments in the exit paths. So I think it is specifically > not worth trying to make any effort to turn *these* WRMSR's into more > optimised forms. > > But WRMSRLIST was designed specifically for this kind of usecase > (actually, more for the main context switch path) where you can prepare > the list of MSRs in memory, including the ability to conditionally skip > certain entries by adjusting the index field. > > > It occurs to me, having written this out, is that what we actually want > to do is have slightly custom not-quite-alternative blocks. We have a > sequence of independent code blocks, and a small block at the end that > happens to contain an IRET. > > We could remove the nops at boot time if we treated it as one large > region, with the IRET at the end also able to have a variable position, > and assembles the "active" blocks tightly from the start. Complications > would include adjusting the IRET extable entry, but this isn't > insurmountable. Entrypoints are a bit more tricky but could be done by > packing from the back forward, and adjusting the entry position. > > Either way, something to ponder. (It's also possible that it doesn't > make a measurable difference until we get to FRED, at which point we > have a set of fresh entry-points to write anyway, and a slight glimmer > of hope of not needing to pollute them with speculation workarounds...) > >> Since we're going to run out of SCF_* bits soon and since the new flag >> is meaningful only in struct cpu_info's spec_ctrl_flags, we could choose >> to widen that field to 16 bits right away and then use bit 8 (or higher) >> for the purpose here. > > I really don't think it matters. We've got plenty of room, and the > flexibility to shuffle, in both structures. It's absolutely not worth > trying to introduce asymmetries to save 1 bit. Thanks for all the comments up to here. Just to clarify - I've not spotted anything there that would result in me being expected to take any action. >> --- a/xen/arch/x86/include/asm/current.h >> +++ b/xen/arch/x86/include/asm/current.h >> @@ -55,9 +55,13 @@ struct cpu_info { >> >> /* See asm/spec_ctrl_asm.h for usage. */ >> unsigned int shadow_spec_ctrl; >> + /* >> + * spec_ctrl_flags can be accessed as a 32-bit entity and hence needs >> + * placing suitably. > > I'd suggest "is accessed as a 32-bit entity, and wants aligning suitably" ? I've tried to choose the wording carefully: The 32-bit access is in an alternative block, so doesn't always come into play. Hence the "may", not "is". Alignment alone also isn't sufficient here (and mis-aligning isn't merely a performance problem) - the following three bytes also need to be valid to access in the first place. Hence "needs" and "placing", not "wants" and "aligning". > If I've followed the logic correctly. (I can't say I was specifically > aware that the bit test instructions didn't have byte forms, but I > suspect such instruction forms would be very very niche.) Yes, there not being byte forms of BT* is the sole reason here. >> --- a/xen/arch/x86/include/asm/spec_ctrl.h >> +++ b/xen/arch/x86/include/asm/spec_ctrl.h >> @@ -36,6 +36,8 @@ >> #define SCF_verw (1 << 3) >> #define SCF_ist_ibpb (1 << 4) >> #define SCF_entry_ibpb (1 << 5) >> +#define SCF_exit_ibpb_bit 6 >> +#define SCF_exit_ibpb (1 << SCF_exit_ibpb_bit) > > One option to avoid the second define is to use ILOG2() with btrl. Specifically not. The assembler doesn't know the conditional operator, and the pre-processor won't collapse the expression resulting from expanding ilog2(). >> @@ -272,6 +293,14 @@ >> #define SPEC_CTRL_EXIT_TO_PV \ >> ALTERNATIVE "", \ >> DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV; \ >> + ALTERNATIVE __stringify(jmp PASTE(.Lscexitpv_done, __LINE__)), \ >> + __stringify(DO_SPEC_CTRL_EXIT_IBPB \ >> + disp=(PASTE(.Lscexitpv_done, __LINE__) - \ >> + PASTE(.Lscexitpv_rsb, __LINE__))), \ >> + X86_FEATURE_IBPB_EXIT_PV; \ >> +PASTE(.Lscexitpv_rsb, __LINE__): \ >> + ALTERNATIVE "", DO_OVERWRITE_RSB, X86_BUG_IBPB_NO_RET; \ >> +PASTE(.Lscexitpv_done, __LINE__): \ >> DO_SPEC_CTRL_COND_VERW > > What's wrong with the normal %= trick? We're in a C macro here which is then used in assembly code. %= only works in asm(), though. If we were in an assembler macro, I'd have used \@. Yet wrapping the whole thing in an assembler macro would, for my taste, hide too much information from the use sites (in particular the X86_{FEATURE,BUG}_* which are imo relevant to be visible there). > The use of __LINE__ makes this > hard to subsequently livepatch, so I'd prefer to avoid it if possible. Yes, I was certainly aware this would be a concern. I couldn't think of a (forward looking) clean solution, though: Right now we have only one use per source file (the native and compat PV entry.S), so we could use a context-independent label name. But as you say above, for FRED we're likely to get new entry points, and they're likely better placed in the same files. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.