[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] x86/vMSI-X: avoid missing first unmask of vectors



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> Sent: 01 April 2016 14:43
> To: Paul Durrant
> Cc: Andrew Cooper; Anthony Perard; Stefano Stabellini; xen-devel; Keir
> (Xen.org)
> Subject: RE: [PATCH RFC] x86/vMSI-X: avoid missing first unmask of vectors
> 
> >>> On 01.04.16 at 15:01, <Paul.Durrant@xxxxxxxxxx> wrote:
> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> >> Sent: 01 April 2016 12:21
> >> >>> On 01.04.16 at 12:56, <Paul.Durrant@xxxxxxxxxx> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> >> >> Sent: 01 April 2016 10:59
> >> >> >>> On 01.04.16 at 11:15, <JBeulich@xxxxxxxx> wrote:
> >> >> > Recent changes to Linux result in there just being a single unmask
> >> >> > operation prior to expecting the first interrupts to arrive. However,
> >> >> > we've had a chicken-and-egg problem here: Qemu invokes
> >> >> > xc_domain_update_msi_irq(), ultimately leading to
> >> >> > msixtbl_pt_register(), upon seeing that first unmask operation. Yet
> >> >> > for msixtbl_range() to return true (in order to msixtbl_write() to get
> >> >> > invoked at all) msixtbl_pt_register() must have completed.
> >> >> >
> >> >> > Deal with this by snooping suitable writes in msixtbl_range() and
> >> >> > triggering the invocation of msix_write_completion() from
> >> >> > msixtbl_pt_register() when that happens in the context of a still in
> >> >> > progress vector control field write.
> >> >> >
> >> >> > Note that the seemingly unrelated deletion of the redundant
> >> >> > irq_desc->msi_desc checks in msixtbl_pt_register() is to make clear
> to
> >> >> > any compiler version used that the "msi_desc" local variable isn't
> >> >> > being used uninitialized. (Doing the same in msixtbl_pt_unregister()
> is
> >> >> > just for consistency reasons.)
> >> >> >
> >> >> > Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> >> >> > ---
> >> >> > TODO: How to deal with REP MOVS to the MSI-X table (in
> >> msixtbl_range())?
> >> >>
> >> >> Some more detail on the thoughts I so far had for this aspect:
> >> >> It has always been puzzling me that the hypervisor doesn't
> >> >> see _all_ the MSI-X table accesses (which is a result of the
> >> >> addresses only getting registered via XEN_DOMCTL_bind_pt_irq);
> >> >> it's quite natural that this is an at least latent issue possibly
> >> >> causing guest misbehavior. I cannot, however, currently see any
> >> >> way to address this without altering both Xen and qemu, since for
> >> >> Xen to see all accesses it would need to become aware of the
> >> >> GPA of the MSI-X table much earlier (read: before the domain
> >> >> actually start, or at the latest when the domain first enables
> >> >> memory decoding on the device).
> >> >>
> >> >> The mapping of the MMIO BARs of the device into guest memory,
> >> >> however, intentionally excludes the page(s) covering the MSI-X
> >> >> table, so the hypervisor can't become aware of them by just
> >> >> looking at data it gets presented today. Hence either we need to
> >> >> add some new hypercall for qemu to invoke, or we need to make
> >> >> qemu map the full BAR ranges, filtering out MSI-X table pages
> >> >> in the hypervisor (using those mapping requests just to learn the
> >> >> GPA of the MSI-X table, without entering them into P2M).
> >> >>
> >> >> Unless someone can think of a way which doesn't require altering
> >> >> both qemu and Xen (creating the well known compatibility issues
> >> >> between unmatched pairs), I think the patch as presented should
> >> >> be okay without handling this case, i.e. best possible effort, and a
> >> >> subsequent change then ought to be to deal with this by changing
> >> >> both components. In which case I'd suggest that the change here
> >> >> go into 4.7, but the full fix would then likely need deferring until
> >> >> 4.8.
> >> >
> >> > I guess it could be handled entirely in Xen if we are willing to snoop on
> >> > PCI configuration. It would not be too hard to snoop guest writes to the
> >> BARs
> >> > in config space so that Xen can keep track of where they are. Snooping
> on
> >> the
> >> > MSI-X capability could then tell Xen when to start interposing on the
> table,
> >> > and allow it to discover the GPA at that point (via the BIR and offset
> >> > values).
> >>
> >> Well, that's a possibility, but won't - afaict - work without qemu's
> >> help at another point: So far we don't know the guest's PCI bus
> >> topology, hence we can't correlate vBAR writes we might snoop
> >> with the physical devices they correspond to.
> >
> > Does Xen need to know that correspondence just to track state? I thought
> the
> > problem here was that Xen does not see every guest access to an MSI-X
> table.
> > If Xen always interposes on MSI-X tables then it can at least track the 
> > state
> > of the emulated table, even if we end up just forwarding the access for
> QEMU
> > to handle at first. When the mapping is created to the actual h/w table
> then,
> > presumably, Xen's idea of state should correspond to QEMU's.
> 
> But Xen doesn't see the guest view of config space,

Well Xen interposes on every single config cycle so arguably it sees exactly 
what the guest sees.

> and the
> whereabouts of the MSI-X table are in read-only config space fields
> (i.e. snooping writes to those doesn't make sense, as the guest may
> not ever write them or write anything, which would just get dropped
> on the floor).

I was thinking of snooping the reads. The guest has to find out where the table 
is before it can write to it, right? Or Xen could even pro-actively read the 
capability chain of any pcidev registered with it.

> 
> And additionally msixtbl_addr_to_desc() needs to know the physical
> device.
> 

Yes, but msixtbl_range() could be trivially changed to accept any access where 
msixtbl_find_entry() returns non-NULL. That would allow msixtbl_write() to 
manipulate entry->flags even if msixtbl_addr_to_desc() returns NULL.

  Paul

> Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.