[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen/arm: Virtual ITS command queue handling



On Tue, 2015-05-12 at 18:35 +0100, Julien Grall wrote:
> > Message signalled interrupts are translated into an LPI via a
> > translation table which must be configured for each device which can
> > generate an MSI. The ITS uses the device id of the originating device
> > to lookup the corresponding translation table. Devices IDs are
> > typically described via system firmware, e.g. the ACPI IORT table or
> > via device tree.
> > 
> > The ITS is configured and managed, including establishing a
> > Translation Table for each device, via an in memory ring shared
> 
> s/an in/a/?

Either is acceptable IMHO. "an (in memory) ring" is how you would parse
what I've written.

> > # vITS
> > 
> > A guest domain which is allowed to use ITS functionality (i.e. has
> > been assigned pass-through devices which can generate MSIs) will be
> > presented with a virtualised ITS.
> > 
> > Accesses to the vITS registers will trap to Xen and be emulated and a
> > virtualised Command Queue will be provided.
> > 
> > Commands entered onto the virtual Command Queue will be translated
> > into physical commands (this translation is described in the GIC
> > specification).
> > 
> > XXX there are other aspects to virtualising the ITS (LPI collection
> > management, assignment of LPI ranges to guests).
> 
> Another aspect to think is device management.

Added.

> > However these are not
> > currently considered here. XXX Should they be/do they need to be?
> 
> I think those aspects are straightforward and doesn't require any
> specific design docs. We could discuss about it during the
> implementation (number of LPIs supported, LPIs allocations...).

OK

> > On read from the virtual `CREADR` iff the vits_cq is such that
> 
> s/iff/if/

"iff" is a shorthand for "if and only if". Apparently not as common as I
think it is though!

> 
> > commands are outstanding then a scheduling pass is attempted (in order
> > to update `vits_cq.creadr`). The current value of `vitq_cq.creadr` is
> > then returned.
> > 
> > ### pITS Scheduling
> 
> I'm not sure if the design document is the right place to talk about it.
> 
> If a domain die during the process , how would it affect the scheduler?


So I think we have to wait for them to finish.

Vague thoughts:

        We can't free a `vits_cq` while has things on the physical
        control
        queue, and we cannot cancel things which are on the control
        queue.
                
        So we must wait.
                
        Obviously don't enqueue anything new onto the pits if
        `d->is_dying`.
        
        `domain_relinquish_resources()` waits (somehow, with suitable
        continuations etc) for anything which the `vits_cq` has
        outstanding to be completed so that the datastructures can be
        cleared.

?

I've added that to a new section "Domain Shutdown" right after
scheduling.

> >   on that vits;
> > * On receipt of an interrupt notification arising from Xen's own use
> >   of `INT`; (see discussion under Completion)
> > * On any interrupt injection arising from a guests use of the `INT`
> >   command; (XXX perhaps, see discussion under Completion)
> 
> With all the solution suggested, it will be very likely that we will try
> to execute multiple the scheduling pass at the same time.
> 
> One way is to wait, until the previous pass as finished. But that would
> mean that the scheduler would be executed very often.
> 
> Or maybe you plan to offload the scheduler in a softirq?

Good point.

A soft irq might be one solution, but it is problematic during emulation
of `CREADR`, when we would like to do a pass immediately to complete any
operations outstanding for the domain doing the read.

Or just using spin_try_lock and not bothering if one is already in
progress might be another. But has similar problems.

Or we could defer only scheduling from `INT` (either guest or Xen's own)
to a softirq but do ones from `CREADR` emulation synchronously? The
softirq would be run on return from the interrupt handler but multiple
such would be coalesced I think?

I've not updated the doc (apart from a note to remember the issue) while
we think about this.

> 
> > Each scheduling pass will:
> > 
> > * Read the physical `CREADR`;
> > * For each command between `pits.last_creadr` and the new `CREADR`
> >   value process completion of that command and update the
> >   corresponding `vits_cq.creadr`.
> > * Attempt to refill the pITS Command Queue (see below).
> > 
> > ### Filling the pITS Command Queue.
> > 
> > Various algorithms could be used here. For now a simple proposal is
> > to traverse the `pits.schedule_list` starting from where the last
> > refill finished (i.e not from the top of the list each time).
> > 
> > If a `vits_cq` has no pending commands then it is removed from the
> > list.
> > 
> > If a `vits_cq` has some pending commands then `min(pits-free-slots,
> > vits-outstanding, VITS_BATCH_SIZE)` will be taken from the vITS
> > command queue, translated and placed onto the pITS
> > queue. `vits_cq.progress` will be updated to reflect this.
> > 
> > Each `vits_cq` is handled in turn in this way until the pITS Command
> > Queue is full or there are no more outstanding commands.
> > 
> > There will likely need to be a data structure which shadows the pITS
> > Command Queue slots with references to the `vits_cq` which has a
> > command currently occupying that slot and corresponding the index into
> > the virtual command queue, for use when completing a command.
> > 
> > `VITS_BATCH_SIZE` should be small, TBD say 4 or 8.
> > 
> > Possible simplification: If we arrange that no guest ever has multiple
> > batches in flight (which can occur if we wrap around the list several
> > times) then we may be able to simplify the book keeping
> > required. However this may need some careful thought wrt fairness for
> > guests submitting frequent small batches of commands vs those sending
> > large batches.
> 
> I'm concerned about the time consumed by filling the pITS Command Queue.
> 
> AFAIU the process suggested, Xen will inject small batch as long as the
> physical command queue is not full.

> Let's take a simple case, only a single domain is using vITS on the
> platform. If it injects a huge number of commands, Xen will split it
> with lots of small batch. All batch will be injected in the same pass as
> long as it fits in the physical command queue. Am I correct?

That's how it is currently written, yes. With the "possible
simplification" above the answer is no, only a batch at a time would be
written for each guest.

BTW, it doesn't have to be a single guest, the sum total of the
injections across all guests could also take a similar amount of time.
Is that a concern?

> I think we have to restrict total number of batch (i.e for all the
> domain) injected in a same scheduling pass.
> 
> I would even tend to allow only one in flight batch per domain. That
> would limit the possible problem I pointed out.

This is the "possible simplification" I think. Since it simplifies other
things (I think) as well as addressing this issue I think it might be a
good idea.

> 
> > 
> > ### Completion
> > 
> > It is expected that commands will normally be completed (resulting in
> > an update of the corresponding `vits_cq.creadr`) via guest read from
> > `CREADR`. This will trigger a scheduling pass which will ensure the
> > `vits_cq.creadr` value is up to date before it is returned.
> > 
> > A guest which does completion via the use of `INT` cannot observe
> > `CREADR` without reading it, so updating on read from `CREADR`
> > suffices from the point of view of the guests observation of the
> > state. (Of course we will inject the interrupt at the designated point
> > and the guest may well then read `CREADR`)
> > 
> > However in order to keep the pITS Command Queue moving along we need
> > to consider what happens if there are no `INT` based events nor reads
> > from `CREADR` to drive completion and therefore refilling of the Queue
> > with other outstanding commands.
> > 
> > A guest which enqueues some commands and then never checks for
> > completion cannot itself block things because any other guest which
> > reads `CREADR` will drive completion. However if _no_ guest reads from
> > `CREADR` then completion will not occur and this must be dealt with.
> > 
> > Even if we include completion on `INT`-base interrupt injection then
> > it is possible that the pITS queue may not contain any such
> > interrupts, either because no guest is using them or because the
> > batching means that none of them are enqueued on the active ring at
> > the moment.
> > 
> > So we need a fallback to ensure that queue keeps moving. There are
> > several options:
> > 
> > * A periodic timer in Xen which runs whenever there are outstanding
> >   commands in the pITS. This is simple but pretty sucky.
> > * Xen injects its own `INT` commands into the pITS ring. This requires
> >   figuring out a device ID to use.
> > 
> > The second option is likely to be preferable if the issue of selecting
> > a device ID can be addressed.
> > 
> > A secondary question is when these `INT` commands should be inserted
> > into the command stream:
> > 
> > * After each batch taken from a single `vits_cq`;
> > * After each scheduling pass;
> > * One active in the command stream at any given time;
> > 
> > The latter should be sufficient, by arranging to insert a `INT` into
> > the stream at the end of any scheduling pass which occurs while there
> > is not a currently outstanding `INT` we have sufficient backstop to
> > allow us to refill the ring.
> > 
> > This assumes that there is no particular benefit to keeping the
> > `CWRITER` rolling ahead of the pITS's actual processing.
> 
> I don't understand this assumption. CWRITER will always point to the
> last command in the queue.

Correct, but that might be ahead of where the pITS has actually gotten
to (which we cannot see).

What I am trying to say here is that there is no point in trying to
eagerly complete things (by checking `CREADR`) such that we can write
new things (and hence push `CWRITER` forward) just to keep ahead of the
pITS' processing.


> 
> > This is true
> > because the IRS operates on commands in the order they appear in the
> 
> s/IRS/ITS/ ?

Yes.

> 
> > queue, so there is no need to maintain a runway ahead of the ITS
> > processing. (XXX If this is a concern perhaps the INT could be
> > inserted at the head of the final batch of commands in a scheduling
> > pass instead of the tail).
> > 
> > Xen itself should never need to issue an associated `SYNC` command,
> > since the individual guests would need to issue those themselves when
> > they care. The `INT` only serves to allow Xen to enqueue new commands
> > when there is space on the ring, it has no interest itself on the
> > actual completion.
> > 
> > ### Locking
> > 
> > It may be preferable to use `atomic_t` types for various fields
> > (e.g. `vits_cq.creadr`) in order to reduce the amount and scope of
> > locking required.
> > 
> > ### Multiple vITS instances in a single guest
> > 
> > As described above each vITS maps to exactly one pITS (while each pITS
> > servers multiple vITSs).
> > 
> > In principal it might be possible to arrange that a vITS can enqueue
> > commands to different pITSs depending on e.g. the device id. However
> > this brings significant additional complexity (what to do with SYNC
> > commands, how order completion such that one pITS does not block
> > another, book keeping etc).
> > 
> > In addition the introduction of direct interrupt injection in version
> > 4 GICs may imply a vITS per pITS. (XXX???)
> 
> GICv4 will directly mark the LPIs pending in the virtual pending table
> which is per-redistributor (i.e per-vCPU).
> 
> LPIs will be received by the guest the same way as an SPIs. I.e trap in
> IRQ mode then read ICC_IAR1_EL1 (for GICv3).
> 
> So I don't think that GICv4 will require one vITS per pITS.

OK, that's good.

> > Therefore it is proposed that the restriction that a single vITS maps
> > to one pITS be retained. If a guest requires access to devices
> > associated with multiple pITSs then multiple vITS should be
> > configured.
> 
> Having multiple vITS per domain brings other issues:
>       - How do you know the number of ITS to describe in the device tree at 
> boot?

I'm not sure. I don't think 1 vs N is very different from the question
of 0 vs 1 though, somehow the tools need to know about the pITS setup.

>       - How do you tell to the guest that the PCI device is mapped to a
> specific vITS?

Device Tree or IORT, just like on native and just like we'd have to tell
the guest about that mapping even if there was a single vITS.

I think the complexity of having one vITS target multiple pITSs is going
to be quite high in terms of data structures and the amount of
thinking/tracking scheduler code will have to do, mostly down to out of
order completion of things put in the pITS queue.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.