[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Re: [PATCH 02/15] [swiotlb] Add swiotlb_engine structure for tracking multiple software IO TLBs.

To: Chris Wright <chrisw@xxxxxxxxxxxx>
From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Date: Tue, 19 Jan 2010 12:46:54 -0500
Cc: Ian.Campbell@xxxxxxxxxxxxx, jeremy@xxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxx, joerg.roedel@xxxxxxx, fujita.tomonori@xxxxxxxxxxxxx, iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx, dwmw2@xxxxxxxxxxxxx, alex.williamson@xxxxxx
Delivery-date: Tue, 19 Jan 2010 09:54:14 -0800
List-id: Xen developer discussion <xen-devel.lists.xensource.com>

. snip..
> You can move the comments to a kerneldoc section for proper
> documentation.
> 
> /**
>  * struct swiotlb_engine - short desc...
>  * @name:     Name of the engine...
> etc

<nods>.
.. snip ..
> > +   char *end;
> 
> Isn't this still global to swiotlb, not specific to the backend impl.?

Yes and no. Without the start/end, the "is_swiotlb_buffer"
would not be able to determine if the passed in address is within the
SWIOTLB buffer.

> 
> > +   /*
> > +    * The number of IO TLB blocks (in groups of 64) betweeen start and
> > +    * end.  This is command line adjustable via setup_io_tlb_npages.
> > +    */
> > +   unsigned long nslabs;
> 
> Same here.
> 
That one can be put back (make it part of lib/swiotlb.c)
> > +
> > +   /*
> > +    * When the IOMMU overflows we return a fallback buffer.
> > +    * This sets the size.
> > +    */
> > +   unsigned long overflow;
> > +
> > +   void *overflow_buffer;
> 
> And here.
Ditto.
..snip ..
> > +    * Is the DMA (Bus) address within our bounce buffer (start and end).
> > +    */
> > +   int (*is_swiotlb_buffer)(struct swiotlb_engine *, dma_addr_t dev_addr,
> > +                            phys_addr_t phys);
> > +
> 
> Why is this implementation specific?

In the current implementation, they both use the physical address and
do a simple check:

        return paddr >= virt_to_phys(io_tlb_start) &&
                paddr < virt_to_phys(io_tlb_end);

That for virtualized environments where a PCI device is passed in would
work too.

Unfortunately the problem is when we provide a channel of communication
with another domain and we end up doing DMA on behalf of another guest.
The short description of the problem is that a page of memory is shared
with another domain and the mapping in our domain is correct (bus->physical)
the other way (virt->physical->bus) is incorrect for the duration of this page
being shared. Hence we need to verify that the page is local to our
domain, and for that we need the bus address to verify that the
addr ==  physical->bus(bus->physical(addr)) where addr is the bus
address (dma_addr_t). If it is not local (shared with another domain)
we MUST not consider it as a SWIOTLB buffer as that can lead to
panics and possible corruptions. The trick here is that the phys->virt
address can fall within the SWIOTLB buffer for pages that are
shared with another domain and we need the DMA address to do an extra check.

The long description of the problem is:

You are the domain doing some DMA on behalf of another domain. The
simple example is you are servicing a block device to the other guests.
One way to implement this is to present a one page ring buffer where
both domains move the producer and consumer indexes around. Once you get
a request (READ/WRITE), you use the virtualized channels to "share" that page
into your domain. For this you have a buffer (2MB or bigger) wherein for
pages that shared in to you, you over-write the phys->bus mapping.
That means that the phys->phys translation is altered for the duration
of this request being out-standing. Once it is completed, the phys->bus
translation is restored.

Here is a little diagram of what happens when a page is shared (and lets
assume that we have a situation where virt #1 == virt #2, which means
that phys #1 == phys #2).

(domain 2) virt#1->phys#1---\
                             +- bus #1
(domain 3) virt#2->phys#2 ---/

(phys#1 points to bus #1, and phys#2 points to bus #1 too).

The converse of the above picture is not true:

      /---> phys #1-> virt #1. (domain 2).
bus#1 +
      \---> phys #2-> virt #2. (domain 3).

phys #1 != phys #2 and hence virt #1 != virt #2.

When a page is not shared:

(domain 2) virt #1->phys #1--> bus #1
(domain 3) virt #2->phys #2--> bus #2

bus #1 -> phys #1 -> virt #1 (domain 2)
bus #2 -> phys #2 -> virt #2 (domain 3)

The bus #1 != bus #2, but phys #1 could be same as phys #2 (since
there are just PFNs). And virt #1 == virt #2.

The reason for these is that each domain has its own memory layout where
the memory starts at pfn 0, not at some higher number. So each domain
sees the physical address identically, but the bus address MUST point
to different areas (except when sharing) otherwise one domain would
over-write another domain, ouch.

Furthermore when a domain is allocated, the pages for the domain are not
guaranteed to be linearly contiguous so we can't guarantee that phys == bus.

So to guard against the situation in which phys #1 ->virt comes out with
an address that looks to be within our SWIOTLB buffer we need to do the
extra check:

addr ==  physical->bus(bus->physical(addr)) where addr is the bus
address

And for scenarios where this is not true (page belongs to another
domain), that page is not in the SWIOTLB (even thought the virtual and
physical address point to it).

> > +   /*
> > +    * Is the DMA (Bus) address reachable by the PCI device?.
> > +    */
> > +   bool (*dma_capable)(struct device *, dma_addr_t, phys_addr_t, size_t);

I mentioned in the previous explanation that when a domain is allocated,
the pages are not guaranteed to be linearly contiguous.

For bare-metal that is not the case and 'dma_capable' just checks
the device DMA mask against the bus address.

For virtualized environment we do need to check if the pages are linearly
contiguous for the size request.

For that we need the physical address to iterate over them doing the
phys->bus#1 translation and checking whether the  (phys+1)->bus#2
bus#1 == bus#2 + 1.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

Follow-Ups:
- [Xen-devel] Re: [PATCH 02/15] [swiotlb] Add swiotlb_engine structure for tracking multiple software IO TLBs.
  - From: Chris Wright

References:
- [Xen-devel] [RFC SWIOTLB-0.2]
  - From: Konrad Rzeszutek Wilk
- [Xen-devel] [PATCH 01/15] [swiotlb] fix: Update 'setup_io_tlb_npages' to accept both arguments in either order.
  - From: Konrad Rzeszutek Wilk
- [Xen-devel] [PATCH 02/15] [swiotlb] Add swiotlb_engine structure for tracking multiple software IO TLBs.
  - From: Konrad Rzeszutek Wilk
- [Xen-devel] Re: [PATCH 02/15] [swiotlb] Add swiotlb_engine structure for tracking multiple software IO TLBs.
  - From: Chris Wright

Prev by Date: [Xen-devel] Re: [PATCH 04/15] [swiotlb] Search and replace s/io_tlb/iommu_sw->/
Next by Date: [Xen-devel] Re: [PATCH 01/15] [swiotlb] fix: Update 'setup_io_tlb_npages' to accept both arguments in either order.
Previous by thread: [Xen-devel] Re: [PATCH 02/15] [swiotlb] Add swiotlb_engine structure for tracking multiple software IO TLBs.
Next by thread: [Xen-devel] Re: [PATCH 02/15] [swiotlb] Add swiotlb_engine structure for tracking multiple software IO TLBs.
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.