[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 07/15] xen/riscv: introduce tracking of pending vCPU interrupts, part 1

To: Jan Beulich <jbeulich@xxxxxxxx>
From: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>
Date: Tue, 13 Jan 2026 13:51:09 +0100
Cc: Alistair Francis <alistair.francis@xxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
Delivery-date: Tue, 13 Jan 2026 12:51:28 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>


On 1/7/26 5:28 PM, Jan Beulich wrote:

On 24.12.2025 18:03, Oleksii Kurochko wrote:

This patch is based on Linux kernel 6.16.0.

Introduce a lockless mechanism for tracking pending vCPU interrupts using
atomic bit operations. The design follows a multi-producer, single-consumer
model where the consumer is the vCPU itself.

Two bitmaps are added:
  - irqs_pending — represents interrupts currently pending
  - irqs_pending_mask — represents bits that have changed in irqs_pending

Introduce vcpu_(un)set_interrupt() to mark an interrupt in irqs_pending{_mask}
bitmap(s) to notify vCPU that it has or no an interrupt.

It's not becoming clear how these are going to be used. It's also not clear
to me whether you really need to record these in software: Aren't there
(virtual) registers where they would be more naturally tracked, much like
hardware would do?


Guest (virtual) registers are not used to inject interrupts on RISC-V; for that
purpose, the HVIP register is provided. Even without considering HVIP, using 
guest
(virtual) registers has a downside: if a bit in hideleg is zero, the 
corresponding
bit in VSIP is read-only zero. During a context_switch(), when CSRs are saved,
this means we would not obtain correct values, since some VSIP bits may read as
zero during csr_read().

In fact, this is one of the reasons why we want to track interrupts to be
injected separately. For example, a vtimer may expire while the vCPU is running
on a different pCPU, so we update vCPU->hvip while the vCPU is active elsewhere.
When the vCPU is later switched in during a context_switch(), we would lose the
fact that vCPU->hvip.vtimer was set to 1, because the CSR save function will do:
  vCPU->hvip = csr_read(CSR_HVIP);
and the pending interrupt state would be overwritten.


Furthermore, since you're dealing with two bitmaps, there's no full
atomicity here anyway. The bitmaps are each dealt with atomically, but
the overall update isn't atomic. Whether that's going to be okay can only
be told when also seeing the producer side.


You're correct that the two-bitmap update isn't fully atomic, but this design
is intentional. Here [1], other  is the part 2 of introduction of pending vCPU 
interrupts
and as it requires more stuff to introduce (for example, [2]) I decided not to
introduce it now with some stubs and introduce it when all will be ready for it.

If a producer is interrupted between updating the two bitmaps the worst case is:
vCPU might process stale state for one cycle, this is resolved on the next 
flush when
the mask indicates the bit changed. No interrupt is permanently lost or 
spuriously
generated.

[1] 
https://gitlab.com/xen-project/people/olkur/xen/-/commit/31022d515789a032fd994f9ca90965db089dbbd5
void vcpu_flush_interrupts(struct vcpu *v)
{
register_t *hvip = &v->arch.hvip;

unsigned long mask, val;

if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
{
mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;

*hvip &= ~mask;
*hvip |= val;
}

/* Flush AIA high interrupts */
vcpu_aia_flush_interrupts(v);

vcpu_update_hvip(v);
}


void vcpu_sync_interrupts(struct vcpu *v)
{
unsigned long hvip;

/* Read current HVIP and VSIE CSRs */
v->arch.vsie = csr_read(CSR_VSIE);

/* Sync-up HVIP.VSSIP bit changes does by Guest */
hvip = csr_read(CSR_HVIP);
if ( (v->arch.hvip ^ hvip) & BIT(IRQ_VS_SOFT, UL) )
{
if ( hvip & BIT(IRQ_VS_SOFT, UL) )
{
if ( !test_and_set_bit(IRQ_VS_SOFT,
&v->arch.irqs_pending_mask) )
set_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
}
else
{
if ( !test_and_set_bit(IRQ_VS_SOFT,
&v->arch.irqs_pending_mask) )
clear_bit(IRQ_VS_SOFT, &v->arch.irqs_pending);
}
}

/* Sync-up AIA high interrupts */
vcpu_aia_sync_interrupts(v);

/* Sync-up timer CSRs */
vtimer_sync(v);
}


[2] 
https://gitlab.com/xen-project/people/olkur/xen/-/commit/1c06b8b1d1eadfe009a4d6b1a1902fac64d080e9

--- a/xen/arch/riscv/domain.c
+++ b/xen/arch/riscv/domain.c
@@ -5,9 +5,11 @@
  #include <xen/sched.h>
  #include <xen/smp.h>

+#include <asm/bitops.h>

  #include <asm/cpufeature.h>
  #include <asm/csr.h>
  #include <asm/riscv_encoding.h>
+#include <asm/system.h>
  #include <asm/vtimer.h>

static void vcpu_csr_init(struct vcpu *v)

@@ -100,6 +102,9 @@ int arch_vcpu_create(struct vcpu *v)
      if ( is_idle_vcpu(v) )
          return rc;

+ bitmap_zero(v->arch.irqs_pending, RISCV_VCPU_NR_IRQS);

+    bitmap_zero(v->arch.irqs_pending_mask, RISCV_VCPU_NR_IRQS);

This is pointless, as struct vcpu starts out all zero.

@@ -135,3 +140,45 @@ void vcpu_kick(struct vcpu *v)
          smp_send_event_check_mask(cpumask_of(v->processor));
      }
  }
+
+int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq)
+{
+    /*
+     * We only allow VS-mode software, timer, and external
+     * interrupts when irq is one of the local interrupts
+     * defined by RISC-V privilege specification.
+     */
+    if ( irq < IRQ_LOCAL_MAX &&

What use is this? In particular this allows an incoming irq with a huge
number to ...

+         irq != IRQ_VS_SOFT &&
+         irq != IRQ_VS_TIMER &&
+         irq != IRQ_VS_EXT )
+        return -EINVAL;
+
+    set_bit(irq, v->arch.irqs_pending);
+    smp_mb__before_atomic();
+    set_bit(irq, v->arch.irqs_pending_mask);

... overrun both bitmaps.


Agree, it would be better just to drop "irq < IRQ_LOCAL_MAX &&".

--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -85,6 +85,22 @@ struct arch_vcpu
      register_t vstval;
      register_t vsatp;
      register_t vsepc;
+
+    /*
+     * VCPU interrupts
+     *
+     * We have a lockless approach for tracking pending VCPU interrupts
+     * implemented using atomic bitops. The irqs_pending bitmap represent
+     * pending interrupts whereas irqs_pending_mask represent bits changed
+     * in irqs_pending.

And hence a set immediately followed by an unset is then indistinguishable
from just an unset (or the other way around).


I think it is distinguishable with the combination of irqs_pending_mask.

  This may not be a problem, but
if it isn't, I think this needs explaining. Much like it is unclear why the
"changed" state needs tracking in the first place.


It is needed to track which bits are changed, irqs_pending only represents
the current state of pending interrupts.CPU might want to react to changes
rather than the absolute state.

Example:
 - If CPU 0 sets an interrupt, CPU 1 needs to notice “something changed”
   to inject it into the VCPU.
 - If CPU 0 sets and then clears the bit before CPU 1 reads it,
   irqs_pending alone shows 0, the transition is lost.
By maintaining irqs_pending_mask, you can detect “this bit changed
recently,” even if the final state is 0.

Also, having irqs_pending_mask allows to flush interrupts without lock:
if ( ACCESS_ONCE(v->arch.irqs_pending_mask[0]) )
{
mask = xchg(&v->arch.irqs_pending_mask[0], 0UL);
val = ACCESS_ONCE(v->arch.irqs_pending[0]) & mask;

*hvip &= ~mask;
*hvip |= val;
}
Without it I assume that we should have spinlcok around access to irqs_pending.

Our approach is modeled around multiple producer
+     * and single consumer problem where the consumer is the VCPU itself.
+     *
+     * DECLARE_BITMAP() is needed here to support 64 vCPU local interrupts
+     * on RV32 host.
+     */
+#define RISCV_VCPU_NR_IRQS 64
+    DECLARE_BITMAP(irqs_pending, RISCV_VCPU_NR_IRQS);
+    DECLARE_BITMAP(irqs_pending_mask, RISCV_VCPU_NR_IRQS);
  }  __cacheline_aligned;

struct paging_domain {

@@ -123,6 +139,9 @@ static inline void update_guest_memory_policy(struct vcpu 
*v,

static inline void arch_vcpu_block(struct vcpu *v) {}+int vcpu_set_interrupt(struct vcpu *v, const unsigned int irq);

+int vcpu_unset_interrupt(struct vcpu *v, const unsigned int irq);

Why the const-s?


As irq number isn't going to be changed inside these functions.

--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -91,6 +91,7 @@
  #define IRQ_M_EXT                     11
  #define IRQ_S_GEXT                    12
  #define IRQ_PMU_OVF                   13
+#define IRQ_LOCAL_MAX          (IRQ_PMU_OVF + 1)

MAX together with "+ 1" looks wrong. What is 14 (which, when MAX is 14,
must be a valid interrupt)? Or if 14 isn't a valid interrupt, please use
NR or NUM.


I didn’t fully understand your idea. Are you suggesting having|IRQ_LOCAL_NR|?
That sounds unclear, as it’s not obvious what it would represent.
Using|MAX_HART| seems better, since it represents the maximum number allowed
for a local interrupt. Any IRQ below that value is considered local, while
values above it are implementation-specific interrupts.

Also, nit: Padding doesn't match with the earlier #define-s (even if in the
quoted text it appears otherwise).


Thanks.

~ Oleksii

Follow-Ups:
- Re: [PATCH v1 07/15] xen/riscv: introduce tracking of pending vCPU interrupts, part 1
  - From: Jan Beulich

References:
- Re: [PATCH v1 07/15] xen/riscv: introduce tracking of pending vCPU interrupts, part 1
  - From: Jan Beulich

Prev by Date: Re: [PATCH v2] x86: Use 32-bit counter for TLB clock on debug builds
Next by Date: [PATCH] x86/cpu-policy: Filter out OS{XSAVE,PKE} in calculate_raw_cpu_policy()
Previous by thread: Re: [PATCH v1 07/15] xen/riscv: introduce tracking of pending vCPU interrupts, part 1
Next by thread: Re: [PATCH v1 07/15] xen/riscv: introduce tracking of pending vCPU interrupts, part 1
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.