Xen project Mailing List

Re: [PATCH v3 1/4] x86/spec-ctrl: add logic to issue IBPB on exit to guest

Date: Mon, 6 Feb 2023 14:58:11 +0100

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=GzqF5368DTaBhVCL2nvfUNDbe9atO/rOWGiNiaI0vb0=; b=gzTscLW3kly2BHwwjyrRlNv3g6pVirKiPPOrajkeOj1BgHZXPGu6APs1zuvgpkSZHhPVKNDLd17mY3UB5HziqcF+hKh5HhEL5llWiQBxc0y/fntlzY8LiHaUMe9+JNMQKlpNX2/pFIkOy9QrFpwrApNSIQIXlN3NqUCNQiyUvhgcLyii1fBNOcqIB8Io4LSeayVmTHN4EoaibZSfw6mtVyP8eIBMnRZnnbSeCYFAhg09uS0AKJKrwcEadGXbn8nDF44vMyF13zGw4m+N2voFZfPn9qAKyHzf1lo921NImep1md0est6K3tKyTYjP8DAtSIfVGfbRrWMyiuqZzvdOmQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=jJrYRecOtZJvyTJivxTgmhGSDLbZcp58THSjLaashuLadpwWBD+Df7cBHp2b/Zfk4vsdcnUesBVsq06eGYId5lZLvj6EPof9y5KP15sfinhRCP0KltY4rZ0knejX5KbikkefbtIkhlLcKkEgtToCV8/1XJQ0oTnn2EvoCgwoFeENe1MG0pWTiS2DXB9K+5grBEhHEwQZNGz5ERQPJS9uMCTyLjV3uA63+J/Sf3c/s8pxV99Y/Q+no8pP0XQLnL0bSiTkoiUgPZr1gzKfQO2d7LbdXYdirNEwZ+wIP0iTm/TINS8oVGoJV5YrihsMzEdd2ypF5q0KfzEuBK+V0Alc0A==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 06 Feb 2023 13:58:35 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 26.01.2023 21:27, Andrew Cooper wrote: > On 26/01/2023 8:02 am, Jan Beulich wrote: >> On 25.01.2023 22:10, Andrew Cooper wrote: >>> On 25/01/2023 3:25 pm, Jan Beulich wrote: >>>> In order to be able to defer the context switch IBPB to the last >>>> possible point, add logic to the exit-to-guest paths to issue the >>>> barrier there, including the "IBPB doesn't flush the RSB/RAS" >>>> workaround. Since alternatives, for now at least, can't nest, emit JMP >>>> to skip past both constructs where both are needed. This may be more >>>> efficient anyway, as the sequence of NOPs is pretty long. >>> It is very uarch specific as to when a jump is less overhead than a line >>> of nops. >>> >>> In all CPUs liable to be running Xen, even unconditional jumps take up >>> branch prediction resource, because all branch prediction is pre-decode >>> these days, so branch locations/types/destinations all need deriving >>> from %rip and "history" alone. >>> >>> So whether a branch or a line of nops is better is a tradeoff between >>> how much competition there is for branch prediction resource, and how >>> efficiently the CPU can brute-force its way through a long line of nops. >>> >>> But a very interesting datapoint. It turns out that AMD Zen4 CPUs >>> macrofuse adjacent nops, including longnops, because it reduces the >>> amount of execute/retire resources required. And a lot of >>> kernel/hypervisor fastpaths have a lot of nops these days. >>> >>> >>> For us, the "can't nest" is singularly more important than any worry >>> about uarch behaviour. We've frankly got much lower hanging fruit than >>> worring about one branch vs a few nops. >>> >>>> LFENCEs are omitted - for HVM a VM entry is immanent, which already >>>> elsewhere we deem sufficiently serializing an event. For 32-bit PV >>>> we're going through IRET, which ought to be good enough as well. While >>>> 64-bit PV may use SYSRET, there are several more conditional branches >>>> there which are all unprotected. >>> Privilege changes are serialsing-ish, and this behaviour has been >>> guaranteed moving forwards, although not documented coherently. >>> >>> CPL (well - privilege, which includes SMM, root/non-root, etc) is not >>> written speculatively. So any logic which needs to modify privilege has >>> to block until it is known to be an architectural execution path. >>> >>> This gets us "lfence-like" or "dispatch serialising" behaviour, which is >>> also the reason why INT3 is our go-to speculation halting instruction. >>> Microcode has to be entirely certain we are going to deliver an >>> interrupt/exception/etc before it can start reading the IDT/etc. >>> >>> Either way, we've been promised that all instructions like IRET, >>> SYS{CALL,RET,ENTER,EXIT}, VM{RUN,LAUNCH,RESUME} (and ERET{U,S} in the >>> future FRED world) do, and shall continue to not execute speculatively. >>> >>> Which in practice means we don't need to worry about Spectre-v1 attack >>> against codepaths which hit an exit-from-xen path, in terms of skipping >>> protections. >>> >>> We do need to be careful about memory accesses and potential double >>> dereferences, but all the data is on the top of the stack for XPTI >>> reasons. About the only concern is v->arch.msrs->* in the HVM path, and >>> we're fine with the current layout (AFAICT). >>> >>>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> >>>> --- >>>> I have to admit that I'm not really certain about the placement of the >>>> IBPB wrt the MSR_SPEC_CTRL writes. For now I've simply used "opposite of >>>> entry". >>> It really doesn't matter. They're independent operations that both need >>> doing, and are fully serialising so can't parallelise. >>> >>> But on this note, WRMSRNS and WRMSRLIST are on the horizon. The CPUs >>> which implement these instructions are the ones which also ought not to >>> need any adjustments in the exit paths. So I think it is specifically >>> not worth trying to make any effort to turn *these* WRMSR's into more >>> optimised forms. >>> >>> But WRMSRLIST was designed specifically for this kind of usecase >>> (actually, more for the main context switch path) where you can prepare >>> the list of MSRs in memory, including the ability to conditionally skip >>> certain entries by adjusting the index field. >>> >>> >>> It occurs to me, having written this out, is that what we actually want >>> to do is have slightly custom not-quite-alternative blocks. We have a >>> sequence of independent code blocks, and a small block at the end that >>> happens to contain an IRET. >>> >>> We could remove the nops at boot time if we treated it as one large >>> region, with the IRET at the end also able to have a variable position, >>> and assembles the "active" blocks tightly from the start. Complications >>> would include adjusting the IRET extable entry, but this isn't >>> insurmountable. Entrypoints are a bit more tricky but could be done by >>> packing from the back forward, and adjusting the entry position. >>> >>> Either way, something to ponder. (It's also possible that it doesn't >>> make a measurable difference until we get to FRED, at which point we >>> have a set of fresh entry-points to write anyway, and a slight glimmer >>> of hope of not needing to pollute them with speculation workarounds...) >>> >>>> Since we're going to run out of SCF_* bits soon and since the new flag >>>> is meaningful only in struct cpu_info's spec_ctrl_flags, we could choose >>>> to widen that field to 16 bits right away and then use bit 8 (or higher) >>>> for the purpose here. >>> I really don't think it matters. We've got plenty of room, and the >>> flexibility to shuffle, in both structures. It's absolutely not worth >>> trying to introduce asymmetries to save 1 bit. >> Thanks for all the comments up to here. Just to clarify - I've not spotted >> anything there that would result in me being expected to take any action. > > I'm tempted to suggest dropping the sentence about "might be more > efficient". It's not necessary at all IMO, and it's probably not > correct if you were to compare an atom microserver vs a big Xeon. Hmm - "might" still isn't "will". ISTR us actually discussing to limit how long a sequence of NOPs we'd tolerate before switching to JMP. >>>> --- a/xen/arch/x86/include/asm/spec_ctrl.h >>>> +++ b/xen/arch/x86/include/asm/spec_ctrl.h >>>> @@ -36,6 +36,8 @@ >>>> #define SCF_verw (1 << 3) >>>> #define SCF_ist_ibpb (1 << 4) >>>> #define SCF_entry_ibpb (1 << 5) >>>> +#define SCF_exit_ibpb_bit 6 >>>> +#define SCF_exit_ibpb (1 << SCF_exit_ibpb_bit) >>> One option to avoid the second define is to use ILOG2() with btrl. >> Specifically not. The assembler doesn't know the conditional operator, >> and the pre-processor won't collapse the expression resulting from >> expanding ilog2(). > > Oh that's dull. > > I suspect we could construct equivalent logic with an .if/.else chain, > but I have no idea if the order of evaluation would be conducive to > doing so as part of evaluating an immediate operand. We would > specifically not want something that ended looking like `ilog2 const > "btrl $" ",%eax"`, even though I could see how to make that work. > > It would be nice if we could make something suitable here, but if not we > can live with the second _bit constant. How would .if/.else be able to go inside an expression? You can't even put this in a .macro, as that can't be invoked as part of an expression either. >>>> @@ -272,6 +293,14 @@ >>>> #define SPEC_CTRL_EXIT_TO_PV \ >>>> ALTERNATIVE "", \ >>>> DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV; \ >>>> + ALTERNATIVE __stringify(jmp PASTE(.Lscexitpv_done, __LINE__)), \ >>>> + __stringify(DO_SPEC_CTRL_EXIT_IBPB \ >>>> + disp=(PASTE(.Lscexitpv_done, __LINE__) - \ >>>> + PASTE(.Lscexitpv_rsb, __LINE__))), \ >>>> + X86_FEATURE_IBPB_EXIT_PV; \ >>>> +PASTE(.Lscexitpv_rsb, __LINE__): \ >>>> + ALTERNATIVE "", DO_OVERWRITE_RSB, X86_BUG_IBPB_NO_RET; \ >>>> +PASTE(.Lscexitpv_done, __LINE__): \ >>>> DO_SPEC_CTRL_COND_VERW >>> What's wrong with the normal %= trick? >> We're in a C macro here which is then used in assembly code. %= only >> works in asm(), though. If we were in an assembler macro, I'd have >> used \@. Yet wrapping the whole thing in an assembler macro would, for >> my taste, hide too much information from the use sites (in particular >> the X86_{FEATURE,BUG}_* which are imo relevant to be visible there). >> >>> The use of __LINE__ makes this >>> hard to subsequently livepatch, so I'd prefer to avoid it if possible. >> Yes, I was certainly aware this would be a concern. I couldn't think of >> a (forward looking) clean solution, though: Right now we have only one >> use per source file (the native and compat PV entry.S), so we could use >> a context-independent label name. But as you say above, for FRED we're >> likely to get new entry points, and they're likely better placed in the >> same files. > > I was going to suggest using __COUNTER__ but I've just realised why that > wont work. > > But on further consideration, this might be ok for livepatching, so long > as __LINE__ is only ever used with a local label name. By the time it > has been compiled to a binary, the label name wont survive; only the > resulting displacement will. > > I think we probably want to merge this with TEMP_NAME() (perhaps as > UNIQ_NAME(), as it will have to move elsewhere to become common with > this) to avoid proliferating our livepatching reasoning. Even if I had recalled that we have TEMP_NAME() somewhere, I'd be very hesitant to make anything like that into more generally accessible infrastructure. I consider TEMP_NAME() itself already a too widely exposed. The problem with it is that you can easily end up with two uses as the result of expanding something that's all contained on a single source line. Hence I very specifically want to have uses of __LINE__ only in places where either it is the top level source line, or where - as is the case here - it is clear that no two instance of the same or a similar macro will ever sensibly be put on one line. (Even then there's still the risk of using the C macro inside an assembler macro, where two instances would cause problems. But if such appeared, making the assembler macro suitably use \@ instead shouldn't be overly difficult.) Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.