[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Mon, 2 Mar 2026 11:43:31 +0000
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=AdKqbfApDT1CGXFK/RBEqtP1Ct6LWFNhZV5lNBvvAYw=; b=bDWWY+OMkgvuKkdiXfs9q58q/T+RvDFT2g/vcqteT957+myjBU3wtjjkuoscXDZQvxj/XjZK6765Lb530FU47W3qrPyCtLZ20Ly2jXkW+GOyt9LlfaJKStHATjLtqNJ2FSdGpzuROiOmnkSmoPffh/6CYjY2W4+mFjgFWDYJ9tW42qqE9w4ZWe5ewbNGnr8JLhTalV7ncum5l0jeikFChDgBdzspzQG2a7sNiAydDnwJ3bGutUBXbLDsHT6aS1sOVRiNsGhVWzY4wYLg6oiL6eFkdcL2I/0wHawl8UgL8glT4qKVbQSjg9BvH9racQx5OZXPsXVmNM5TRGxMfRvhFA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=XMBDs639lkP+09vycFAWTBFVcfFSnuUofY4Af+mSBv7ziicDbkl+voCcz1nbgLCGgOWpgAZphhZGZIm54GlZIqmKrUsgtetVczwxBkrQW8XpSCBd36HvuJxZZX0GHoVLGrTMs+LrKZLbEt6tMJhuuUGA4bkjJP0g6yh63hrWfdcb0NYY06R8HpRwa0hYGpysgTupmaqN3ZRrDoDfwSxWu6C0YLcOxAVZY2dvN6MDTTt8O9fjsdc0V+x7Aax4/c6uZ762yT8azGn2qA1jRyHp5ilLQa40Te47+QsiI40SJxxz9/HfWmZL7vLn8V3H4UgHf4w+ntwqizCQqP9doHROSg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Mon, 02 Mar 2026 11:44:04 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 02/03/2026 11:03 am, Jan Beulich wrote:
> On 28.02.2026 00:16, Andrew Cooper wrote:
>> For INT $N instructions (besides $0x80 for which there is a dedicated fast
>> path), handling is mostly fault-based because of DPL0 gates in the IDT.  This
>> means that when the guest kernel allows the instruction too, Xen must
>> increment %rip to the end of the instruction before passing a trap to the
>> guest kernel.
>>
>> When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
>> will deliver the "trap" with %rip pointing into the middle of the 
>> instruction.
>>
>> Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
>> determine the instruction length, rather than assuming two.
>>
>> This is a change in behaviour for PV guests, but the prior behaviour cannot
>> reasonably be said to be intentional.
>>
>> This change does not affect the INT $0x80 fastpath.  Prefixed INT $N
>> instructions occur almost exclusively in test code or exploits, and INT $0x80
>> appears to be the only user-usable interrupt gate in contemporary PV guests.
> Whereas for the slow path, while the subtracting of 2 from %rip there isn't
> quite right either, the insn size determination here would then simply yield
> 2 as well, so all is good for that case as well.

I've covered that in the docs patch (patch 2).  Because INT $0x80 is
DPL3 and therefore traps, this is the best we can do.

>
>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs 
>> *regs)
>>      return 0;
>>  }
>>  
>> +/*
>> + * Hardware already decoded the INT $N instruction and determinted that 
>> there
>> + * was a DPL issue, hence the #GP.  Xen has already determined that the 
>> guest
>> + * kernel has permitted this software interrupt.
>> + *
>> + * All that is needed is the instruction length, to turn the fault into a
>> + * trap.  All errors are turned back into the original #GP, as that's the
>> + * action that really happened.
>> + */
>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>> +{
>> +    struct vcpu *curr = current;
>> +    struct domain *currd = curr->domain;
>> +    struct priv_op_ctxt ctxt = {
>> +        .ctxt.regs = regs,
>> +        .ctxt.lma = !is_pv_32bit_domain(currd),
> The difference may not be overly significant here, but 64-bit guests can run
> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
> largely benign, perhaps to code could even be left as is, just with a comment
> to clarify things?

LMA must be set for a 64bit guest.  Are you confusing it with %cs.l ?

What's potentially wrong is having LMA clear for a 32bit guest, but this
is how pv_emulate_privileged_op() behaves.  LMA is active in real
hardware when running in a compatibility mode segment.

I don't think anything actually cares about LMA. 
pv_emul_read_descriptor() doesn't audit L and instead relies on us not
permitting a PV32 guest to write a 64bit code segment.

>
>> +    };
>> +    struct x86_emulate_state *state;
>> +    uint8_t vector = regs->error_code >> 3;
>> +    unsigned int len, ar;
>> +
>> +    if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>> +                                  &ctxt.cs.limit, &ar, 1) ||
>> +         !(ar & _SEGMENT_S) ||
>> +         !(ar & _SEGMENT_P) ||
>> +         !(ar & _SEGMENT_CODE) )
>> +        goto error;
>> +
>> +    state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>> +    if ( IS_ERR_OR_NULL(state) )
>> +        goto error;
>> +
>> +    len = x86_insn_length(state, &ctxt.ctxt);
>> +    x86_emulate_free_state(state);
>> +
>> +    /* Note: Checked slightly late to simplify 'state' handling. */
>> +    if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>> +        goto error;
>> +
>> +    regs->rip += len;
>> +    pv_inject_sw_interrupt(vector);
>> +    return;
>> +
>> + error:
>> +    pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
> DYM regs->error_code here?

Oh.  I'm sure I fixed this bug already.  I wonder where the fix got lost.

Yes, it should be regs->error_code.

>  Might it alternatively make sense to return a
> boolean here, for ...
>
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>  
>>          if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>>          {
>> -            regs->rip += 2;
>> -            pv_inject_sw_interrupt(vector);
>> +            pv_emulate_sw_interrupt(regs);
>>              return;
> ... the return here to become conditional, leveraging the #GP injection at
> the bottom of this function?

To make this bool, I need to insert a new label into the function.  I
considered that, but delayed it.  do_general_protection() wants a lot
more cleaning up than just this, and proportionability is a concern.

What I was actually considering was splitting out a new pv_handle_GP()
function to remove the ifdef-ary, and doing a wholesale rework at that
point.

~Andrew

P.S. Something I'm still trying to figure out is how to make
guest_mode() able to DCE based on the caller being
entry_from_{xen,pv}(), because the function can be bifurcated for FRED. 
It doesn't appear that the assume() constructs work, probably because
do_general_protection() can't be inlined due to IDT mode.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.