[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
On 09.09.21 11:24, Jan Beulich wrote: > On 09.09.2021 07:22, Oleksandr Andrushchenko wrote: >> On 08.09.21 18:00, Jan Beulich wrote: >>> On 08.09.2021 16:31, Oleksandr Andrushchenko wrote: >>>> On 06.09.21 17:47, Jan Beulich wrote: >>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote: >>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx> >>>>>> >>>>>> Instead of handling a single range set, that contains all the memory >>>>>> regions of all the BARs and ROM, have them per BAR. >>>>> Without looking at how you carry out this change - this look wrong (as >>>>> in: wasteful) to me. Despite ... >>>>> >>>>>> This is in preparation of making non-identity mappings in p2m for the >>>>>> MMIOs/ROM. >>>>> ... the need for this, every individual BAR is still contiguous in both >>>>> host and guest address spaces, so can be represented as a single >>>>> (start,end) tuple (or a pair thereof, to account for both host and guest >>>>> values). No need to use a rangeset for this. >>>> First of all this change is in preparation for non-identity mappings, >>> I'm afraid I continue to not see how this matters in the discussion at >>> hand. I'm fully aware that this is the goal. >>> >>>> e.g. currently we collect all the memory ranges which require mappings >>>> into a single range set, then we cut off MSI-X regions and then use range >>>> set >>>> functionality to call a callback for every memory range left after MSI-X. >>>> This works perfectly fine for 1:1 mappings, e.g. what we have as the range >>>> set's starting address is what we want to be mapped/unmapped. >>>> Why range sets? Because they allow partial mappings, e.g. you can map part >>>> of >>>> the range and return back and continue from where you stopped. And if I >>>> understand that correctly that was the initial intention of introducing >>>> range sets here. >>>> >>>> For non-identity mappings this becomes not that easy. Each individual BAR >>>> may be >>>> mapped differently according to what guest OS has programmed as >>>> bar->guest_addr >>>> (guest view of the BAR start). >>> I don't see how the rangeset helps here. You have a guest and a host pair >>> of values for every BAR. Pages with e.g. the MSI-X table may not be mapped >>> to their host counterpart address, yes, but you need to special cases >>> these anyway: Accesses to them need to be handled. Hence I'm having a hard >>> time seeing how a per-BAR rangeset (which will cover at most three distinct >>> ranges afaict, which is way too little for this kind of data organization >>> imo) can gain you all this much. >>> >>> Overall the 6 BARs of a device will cover up to 8 non-adjacent ranges. IOW >>> the majority (4 or more) of the rangesets will indeed merely represent a >>> plain (start,end) pair (or be entirely empty). >> First of all, let me explain why I decided to move to per-BAR >> range sets. >> Before this change all the MMIO regions and MSI-X holes were >> accounted by a single range set, e.g. we go over all BARs and >> add MMIOs and then subtract MSI-X from there. When it comes to >> mapping/unmapping we have an assumtion that the starting address of >> each element in the range set is equal to map/unmap address, e.g. >> we have identity mapping. Please note, that the range set accepts >> a single private data parameter which is enough to hold all >> required data about the pdev in common, but there is no way to provide >> any per-BAR data. >> >> Now, that we want non-identity mappings, we can no longer assume >> that starting address == mapping address and we need to provide >> additional information on how to map and which is now per-BAR. >> This is why I decided to use per-BAR range sets. >> >> One of the solutions may be that we form an additional list of >> structures in a form (I ommit some of the fields): >> struct non_identity { >> unsigned long start_mfn; >> unsigned long start_gfn; >> unsigned long size; >> }; >> So this way when the range set gets processed we go over the list >> and find out the corresponding list's element which describes the >> range set entry being processed (s, e, data): >> >> static int map_range(unsigned long s, unsigned long e, void *data, >> unsigned long *c) >> { >> [snip] >> go over the list elements >> if ( list->start_mfn == s ) >> found, can use list->start_gfn for mapping >> [snip] >> } >> This has some complications as map_range may be called multiple times >> for the same range: if {unmap|map}_mmio_regions was not able to complete >> the operation it returns the number of pages it was able to process: >> rc = map->map ? map_mmio_regions(map->d, start_gfn, >> size, _mfn(s)) >> : unmap_mmio_regions(map->d, start_gfn, >> size, _mfn(s)); >> In this case we need to update the list item: >> list->start_mfn += rc; >> list->start_gfn += rc; >> list->size -= rc; >> and if all the pages of the range were processed delete the list entry. >> >> With respect of creating the list everything also not so complicated: >> while processing each BAR create a list entry and fill it with mfn, gfn >> and size. Then, if MSI-X region is present within this BAR, break the >> list item into multiple ones with respect to the holes, for example: >> >> MMIO 0 list item >> MSI-X hole 0 >> MMIO 1 list item >> MSI-X hole 1 >> >> Here instead of a single BAR description we now have 2 list elements >> describing the BAR without MSI-X regions. >> >> All the above still relies on a single range set per pdev as it is in the >> original code. We can go this route if we agree this is more acceptable >> than the range sets per BAR > I guess I am now even more confused: I can't spot any "rangeset per pdev" > either. The rangeset I see being used doesn't get associated with anything > that's device-related; it gets accumulated as a transient data structure, > but _all_ devices owned by a domain influence its final content. You are absolutely right here, sorry for the confusion: in the current code the range set belongs to struct vpci_vcpu, e.g. /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */ > > If you associate rangesets with either a device or a BAR, I'm failing to > see how you'd deal with multiple BARs living in the same page (see also > below). This was exactly the issue I ran into while emulating RTL8139 on QEMU: The MMIOs are 128 bytes long and Linux put them on the same page. So, it is a known limitation that we can't deal with [1] > > Considering that a rangeset really is a compressed representation of a > bitmap, I wonder whether this data structure is suitable at all for what > you want to express. You have two pieces of information to carry / manage, > after all: Which ranges need mapping, and what their GFN <-> MFN > relationship is. Maybe the latter needs expressing differently in the > first place? I proposed a list which can be extended to hold all the required information there, e.g. MFN, GFN, size etc. > And then in a way that's ensuring by its organization that > no conflicting GFN <-> MFN mappings will be possible? If you mean the use-case above with different device MMIOs living in the same page then my understanding is that such a use-case is not supported [1] > Isn't this > precisely what is already getting recorded in the P2M? > > I'm also curious what your plan is to deal with BARs overlapping in MFN > space: In such a case, the guest cannot independently change the GFNs of > any of the involved BARs. (Same the other way around: overlaps in GFN > space are only permitted when the same overlap exists in MFN space.) Are > you excluding (forbidding) this case? If so, did I miss you saying so > somewhere? Again [1] > Yet if no overlaps are allowed in the first place, what > modify_bars() does would be far more complicated than necessary in the > DomU case, so it may be worthwhile considering to deviate more from how > Dom0 gets taken care of. In the end a guest writing a BAR is merely a > request to change its P2M. That's very different from Dom0 writing a BAR, > which means the physical BAR also changes, and hence the P2M changes in > quite different a way. So, what is the difference then besides hwdom really writes to a BAR? To me most of the logic remains the same: we need to map/unmap. The only difference I see here is that for Dom0 we have 1:1 at the moment and for guest we need GFN <-> MFN. Anyways, I am open to any decision on what would be the right approach here: 1. Use range sets per BAR as in the patch 2. Remove range sets completely and have a per-vCPU list with mapping data as I described above 3. Anything else? > > Jan Thank you, Oleksandr [1] https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough#I_get_.22non-page-aligned_MMIO_BAR.22_error_when_trying_to_start_the_guest
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |