Xen project Mailing List

Re: [PATCH for-4.21 03/10] x86/HPET: use single, global, low-priority vector for broadcast IRQ

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Fri, 17 Oct 2025 10:20:41 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=t+YrNMuJRG7HSNdPyubwNjHFZpShc6ZOekQbTXP3uec=; b=iD8nCUxzz1yWjK0ut09k03dS/MQ7NUVTcDLizGyhds4WA0GutTr2e+j8VbYH5QsABN2vNrx1LADW6Ov5a7EhlI7cIOY6uYIoh40gkcQYt1Wqtug4w0uaLlOsoeY85XbUTCL5d29SkTQesM4bPdFkuf+WdinLceIOhQxvkgYziGQg5nA0PucXOm8bfZ8WVjw379CZgH6d0fYeuwCxWMhrWwU9MBg60/dzbsfXodrqh330vAt32a8cVFEiqr42T3IprqtiBGzr0QxLqnNxwuFV2+ZPcd8vtVYZQZc39+wT5dnvzNafhCo5AXzjbdzQNprIhkazzIU5o6j8NwdGpNuU/g==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=dpVoSpG7CQopQWx/2JCy7keWVd9JxBFaM3bwgPhITIDShxtyJoAfSm8yiX+/HYdu4KAQx6lFY8MP9n6Uv/+cwkUGuYxkZB2yMtnkSzasB98IZAO5AvddT/bJhfTH2B36k/HTHcVcptlWHfqKhCRSLvtNiwOyqNiTz0qwS0FkajzAwzpY5UDiXJf5vbhwWZDeFbZ5sxzjRIGYDdhy2JRqEVWCqsoa1rOXJXfhm/LLtGBhLGkAkgKUhg4at772BmPhyV78cskj6iWvMlRypAW2Ibkx+Wn9xufnxas1lENpgXQt7cDJSjyAn3yvcK3wUa/hzIJDBDt3jWLN2/ReKwQ2vA==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>

Delivery-date: Fri, 17 Oct 2025 08:21:05 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, Oct 17, 2025 at 09:15:08AM +0200, Jan Beulich wrote: > On 16.10.2025 18:27, Roger Pau Monné wrote: > > On Thu, Oct 16, 2025 at 09:32:04AM +0200, Jan Beulich wrote: > >> @@ -307,15 +309,13 @@ static void cf_check hpet_msi_set_affini > >> struct hpet_event_channel *ch = desc->action->dev_id; > >> struct msi_msg msg = ch->msi.msg; > >> > >> - msg.dest32 = set_desc_affinity(desc, mask); > >> - if ( msg.dest32 == BAD_APICID ) > >> - return; > >> + /* This really is only for dump_irqs(). */ > >> + cpumask_copy(desc->arch.cpu_mask, mask); > > > > If you no longer call set_desc_affinity(), could you adjust the second > > parameter of hpet_msi_set_affinity() to be unsigned int cpu instead of > > a cpumask? > > Looks like I could, yes. But then we need to split the function, as it's > also used as the .set_affinity hook. I see, I wasn't taking that into account. > > And here just clear desc->arch.cpu_mask and set the passed CPU. > > Which would still better be a cpumask_copy(), just given cpumask_of(cpu) > as input. As is it, yes. > >> - msg.data &= ~MSI_DATA_VECTOR_MASK; > >> - msg.data |= MSI_DATA_VECTOR(desc->arch.vector); > >> + msg.dest32 = cpu_mask_to_apicid(mask); > > > > And here you can just use cpu_physical_id(). > > Right. All of which (up to here; but see below) perhaps better a separate, > follow-on cleanup change. Yes, it's too much fuss, and I also have plans in that area to deal with it myself anyway. Just wanted to avoid changing this now to be changed again. But it's too unrelated to put in this change. > >> msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; > >> msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32); > >> - if ( msg.data != ch->msi.msg.data || msg.dest32 != ch->msi.msg.dest32 > >> ) > >> + if ( msg.dest32 != ch->msi.msg.dest32 ) > >> hpet_msi_write(ch, &msg); > > > > A further note here, which ties to my comment on the previous patch > > about loosing the interrupt during the masked window. If the vector > > is the same across all CPUs, we no longer need to update the MSI data > > field, just the address one, which can be done atomically. We also > > have signaling from the IOMMU whether the MSI fields need writing. > > Hmm, yes, we can leverage that, as long as we're willing to make assumptions > here about what exactly iommu_update_ire_from_msi() does: We'd then rely on > not only the original (untranslated) msg->data not changing, but also the > translated one. That looks to hold for both Intel and AMD, but it's still > something we want to be sure we actually want to make the code dependent > upon. (I'm intending to at least add an assertion to that effect.) We could still mask when needed, but the masking would be conditionally done in hpet_msi_write(). It seems however this might be better done as a followup change. > > We can avoid the masking, and the possible drop of interrupts. > > Hmm, right. There's nothing wrong with the caller relying on the write > being atomic now. (Really, continuing to use hpet_msi_write() wouldn't > be a problem, as re-writing the low half of HPET_Tn_ROUTE() with the > same value is going to be benign. Unless of course that write was the > source of the extra IRQs I'm seeing.) Oh, yes, that's right, we don't even need to avoid the write. > Taking together with what you said further up, having > set_channel_irq_affinity() no longer use hpet_msi_set_affinity() as it > is to ... > > >> @@ -328,7 +328,7 @@ static hw_irq_controller hpet_msi_type = > >> .shutdown = hpet_msi_shutdown, > >> .enable = hpet_msi_unmask, > >> .disable = hpet_msi_mask, > >> - .ack = ack_nonmaskable_msi_irq, > >> + .ack = irq_actor_none, > >> .end = end_nonmaskable_irq, > >> .set_affinity = hpet_msi_set_affinity, > > ... satisfy the use here would then probably be desirable right away. > The little bit that's left of hpet_msi_set_affinity() would then be > open-coded in set_channel_irq_affinity(). As you see fit, I'm not going to insist if the changes become too unrelated to the fix itself. Can always be done as a followup patch, specially taking into account we are in hard code freeze. > Getting rid of the masking would (hopefully) also get rid of the stray > IRQs that I'm observing, assuming my guessing towards the reason there > is correct. > > >> @@ -497,6 +503,7 @@ static void set_channel_irq_affinity(str > >> spin_lock(&desc->lock); > >> hpet_msi_mask(desc); > >> hpet_msi_set_affinity(desc, cpumask_of(ch->cpu)); > >> + per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq; > > > > I would set the vector table ahead of setting the affinity, in case we > > can drop the mask calls around this block of code. > > Isn't there a problematic window either way round? I can make the change, > but I don't see that addressing anything. The new comparator value will > be written later anyway, and interrupts up to that point aren't of any > interest anyway. I.e. it doesn't matter which of the CPUs gets to handle > them. It's preferable to get a silent stray interrupt (if the per-cpu vector table is correctly setup), rather than to get a message from Xen that an unknown vector has been received? If a vector is injected ahead of vector_irq being set Xen would complain in do_IRQ() that that's no handler for such vector. > > I also wonder, do you really need the bind_irq_vector() if you > > manually set the affinity afterwards, and the vector table plus > > desc->arch.cpu_mask are also set here? > > At the very least I'd then also need to open-code the setting of > desc->arch.vector and desc->arch.used. Possibly also the setting of the > bit in desc->arch.used_vectors. And strictly speaking also the > trace_irq_mask() invocation. Let's keep it as-is. > >> --- a/xen/arch/x86/include/asm/irq-vectors.h > >> +++ b/xen/arch/x86/include/asm/irq-vectors.h > >> @@ -18,6 +18,15 @@ > >> /* IRQ0 (timer) is statically allocated but must be high priority. */ > >> #define IRQ0_VECTOR 0xf0 > >> > >> +/* > >> + * Low-priority (for now statically allocated) vectors, sharing entry > >> + * points with exceptions in the 0x10 ... 0x1f range, as long as the > >> + * respective exception has an error code. > >> + */ > >> +#define FIRST_LOPRIORITY_VECTOR 0x10 > >> +#define HPET_BROADCAST_VECTOR X86_EXC_AC > >> +#define LAST_LOPRIORITY_VECTOR 0x1f > > > > I wonder if it won't be clearer to simply reserve a vector if the HPET > > is used, instead of hijacking the AC one. It's one vector less, but > > arguably now that we unconditionally use physical destination mode our > > pool of vectors has expanded considerably. > > Well, I'd really like to avoid consuming an otherwise usable vector, if > at all possible (as per Andrew's FRED plans, that won't be possible > there anymore then). If re-using the AC vector is not possible with FRED we might want to do this uniformly and always consume a vector then? > >> --- a/xen/arch/x86/irq.c > >> +++ b/xen/arch/x86/irq.c > >> @@ -755,8 +755,9 @@ void setup_vector_irq(unsigned int cpu) > >> if ( !irq_desc_initialized(desc) ) > >> continue; > >> vector = irq_to_vector(irq); > >> - if ( vector >= FIRST_HIPRIORITY_VECTOR && > >> - vector <= LAST_HIPRIORITY_VECTOR ) > >> + if ( vector <= (vector >= FIRST_HIPRIORITY_VECTOR > >> + ? LAST_HIPRIORITY_VECTOR > >> + : LAST_LOPRIORITY_VECTOR) ) > >> cpumask_set_cpu(cpu, desc->arch.cpu_mask); > > > > I think this is wrong. The low priority vector used by the HPET will > > only target a single CPU at a time, and hence adding extra CPUs to > > that mask as part of AP bringup is not correct. > > I'm not sure about "wrong". It's not strictly necessary for the HPET one, > I expect, but it's generally what would be necessary. For the HPET one, > hpet_msi_set_affinity() replaces the value anyway. (I can add a sentence > to this effect to the description, if that helps.) I do think it's wrong, it's just not harmful per-se apart from showing up in the output of dump_irqs(). The value in desc->arch.cpu_mask should be the CPU that's the destination of the interrupt. In this case, the HPET interrupt does have a single destination at a give time, and adding another one will make the output of dump_irqs() show two destinations, when the interrupt will target a single interrupt. If anything you should add the CPU to the affinity set (desc->affinity), but that's not needed since you already init the affinity mask with cpumask_setall(). FWIW, I'm working on tentatively getting rid of the desc->arch.{cpu,old_cpu,pending}_mask fields and converting them to plain unsigned ints after we have dropped logical interrupt delivery for external interrupts. Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.