[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 5/9] xen/vpci: add handlers to map the BARs



On Fri, May 19, 2017 at 09:21:56AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@xxxxxxxxxx> wrote:
> > +static int vpci_modify_bars(struct pci_dev *pdev, const bool map)
> > +{
> > +    struct vpci_header *header = &pdev->vpci->header;
> > +    unsigned int i;
> > +    int rc = 0;
> > +
> > +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> > +    {
> > +        paddr_t gaddr = map ? header->bars[i].gaddr
> > +                            : header->bars[i].mapped_addr;
> > +        paddr_t paddr = header->bars[i].paddr;
> > +
> > +        if ( header->bars[i].type != VPCI_BAR_MEM &&
> > +             header->bars[i].type != VPCI_BAR_MEM64_LO )
> > +            continue;
> > +
> > +        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(gaddr)),
> > +                         _mfn(PFN_DOWN(paddr)), 
> > PFN_UP(header->bars[i].size),
> 
> The PFN_UP() indicates a problem: For sub-page BARs you can't
> blindly map/unmap them without taking into consideration other
> devices sharing the same page.

I'm not sure I follow, the start address of BARs is always aligned to
a 4KB boundary, so there's no chance of the same page being used by
two different BARs at the same time.

The size is indeed not aligned to 4KB, but I don't see how this can
cause collisions with other BARs unless the domain is actively trying
to make the BARs overlap, in which case there's not much Xen can do.

> > +                         map);
> > +        if ( rc )
> > +            break;
> > +
> > +        header->bars[i].mapped_addr = map ? gaddr : 0;
> > +    }
> > +
> > +    return rc;
> > +}
> 
> Shouldn't this function somewhere honor the unset flags?

Right, I've added a check to make sure the BAR is positioned before
trying to map it into the domain p2m.

> > +static int vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
> > +                         union vpci_val *val, void *data)
> > +{
> > +    struct vpci_header *header = data;
> > +
> > +    val->word = header->command;
> 
> Rather than reading back and storing the value in the write handler,
> I'd recommending doing an actual read here.

OK.

> > +static int vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> > +                          union vpci_val val, void *data)
> > +{
> > +    struct vpci_header *header = data;
> > +    uint16_t new_cmd, saved_cmd;
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    int rc;
> > +
> > +    new_cmd = val.word;
> > +    saved_cmd = header->command;
> > +
> > +    if ( !((new_cmd ^ saved_cmd) & PCI_COMMAND_MEMORY) )
> > +        goto out;
> > +
> > +    /* Memory space access change. */
> > +    rc = vpci_modify_bars(pdev, new_cmd & PCI_COMMAND_MEMORY);
> > +    if ( rc )
> > +    {
> > +        dprintk(XENLOG_ERR,
> > +                "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
> > +                seg, bus, slot, func,
> > +                new_cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);
> > +        return rc;
> 
> I guess you can guess the question already: What is the bare
> hardware equivalent of this failure return?

Yes, this is already fixed since write handlers simply return void.
The hw equivalent would be to ignore the write AFAICT (ie: memory
decoding will not be enabled).

Are you fine with the dprintk or would you also like me to remove
that? (IMHO it's helpful for debugging).

> > +    }
> > +
> > + out:
> 
> Please try to avoid goto-s and labels for other than error handling
> (and even then only when code would otherwise end up pretty
> convoluted).

Done.

> > +static int vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
> > +                         union vpci_val *val, void *data)
> > +{
> > +    struct vpci_bar *bar = data;
> 
> const
> 
> > +    bool hi = false;
> > +
> > +    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
> > +           bar->type == VPCI_BAR_MEM64_HI);
> > +
> > +    if ( bar->type == VPCI_BAR_MEM64_HI )
> > +    {
> > +        ASSERT(reg - PCI_BASE_ADDRESS_0 > 0);
> 
> reg > PCI_BASE_ADDRESS_0

Fixed.

> > +        bar--;
> > +        hi = true;
> > +    }
> > +
> > +    if ( bar->sizing )
> > +        val->double_word = ~(bar->size - 1) >> (hi ? 32 : 0);
> 
> There's also a comment further down - this is producing undefined
> behavior on 32-bits arches.

I've changed size to be a uint64_t.

> > +static int vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> > +                          union vpci_val val, void *data)
> > +{
> > +    struct vpci_bar *bar = data;
> > +    uint32_t wdata = val.double_word;
> > +    bool hi = false, unset = false;
> > +
> > +    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
> > +           bar->type == VPCI_BAR_MEM64_HI);
> > +
> > +    if ( wdata == GENMASK(31, 0) )
> 
> I'm afraid this again doesn't match real hardware behavior: As the
> low bits are r/o, writes with them having any value, but all other
> bits being 1 should have the same effect. I notice that while I had
> fixed this for the ROM BAR in Linux'es pciback, I should have also
> fixed this for ordinary ones.

I've changed this to:

    switch ( bar->type )
    {
    case VPCI_BAR_MEM:
        size_mask = GENMASK(31, 12);
        break;
    case VPCI_BAR_MEM64_LO:
        size_mask = GENMASK(31, 26);
        break;
    case VPCI_BAR_MEM64_HI:
        size_mask = GENMASK(31, 0);
        break;
    default:
        ASSERT_UNREACHABLE();
        break;
    }

    if ( (wdata & size_mask) == size_mask )
    {
        ...

And removed the ASSERT just above (since it's now handled in the
switch itself).

> > +    {
> > +        /* Next reads from this register are going to return the BAR size. 
> > */
> > +        bar->sizing = true;
> > +        return 0;
> > +    }
> > +
> > +    /* End previous sizing cycle if any. */
> > +    bar->sizing = false;
> > +
> > +    unset = bar->unset;
> > +    if ( unset )
> > +        bar->unset = false;
> > +
> > +    if ( bar->type == VPCI_BAR_MEM64_HI )
> > +    {
> > +        ASSERT(reg - PCI_BASE_ADDRESS_0 > 0);
> > +        bar--;
> > +        hi = true;
> > +    }
> > +
> > +    /* Update the relevant part of the BAR address. */
> > +    bar->gaddr &= hi ? ~GENMASK(63, 32) : ~GENMASK(31, 0);
> > +    wdata &= hi ? GENMASK(31, 0) : PCI_BASE_ADDRESS_MEM_MASK;
> 
> Perhaps easier to grok as
> 
>     if ( hi )
>         wdata &= PCI_BASE_ADDRESS_MEM_MASK;

I've done that (with the condition reversed).

> However, considering the dual use below, I'd prefer if you wrote
> back the value you read to the low 4 bits. They're _supposed_ to
> be r/o, yes, but anyway.

Done.

> 
> > +    bar->gaddr |= (uint64_t)wdata << (hi ? 32 : 0);
> > +
> > +    if ( unset )
> > +    {
> > +        bar->paddr = bar->gaddr;
> 
> So this deals with first time setting of the BAR by Dom0. If Dom0
> later decides to move BARs around, how do you guarantee things
> to continue to work fine if you allow paddr and gaddr to go out of
> sync? Often the reason to do re-assignments is because the OS
> recognized address conflicts. Or it needs to make room for SR-IOV
> BARs.

I've removed the unset check, so that every BAR position change done
by Dom0 is also applied to the hardware, instead of just changing
Dom0's p2m.

> > +        pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> > +                         PCI_FUNC(pdev->devfn), reg, wdata);
> 
> pci_conf_write32()

Ups, thanks.

> 
> > +    }
> > +
> > +    ASSERT(IS_ALIGNED(bar->gaddr, PAGE_SIZE));
> 
> Urgh.

Removed.

> > +static int vpci_init_bars(struct pci_dev *pdev)
> > +{
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    uint8_t header_type;
> > +    unsigned int i, num_bars;
> > +    struct vpci_header *header = &pdev->vpci->header;
> > +    struct vpci_bar *bars = header->bars;
> > +    int rc;
> > +
> > +    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 
> > 0x7f;
> > +    if ( header_type == PCI_HEADER_TYPE_NORMAL )
> > +        num_bars = 6;
> > +    else if ( header_type == PCI_HEADER_TYPE_BRIDGE )
> > +        num_bars = 2;
> > +    else
> > +        return -ENOSYS;
> 
> -EOPNOTSUPP
> 
> > +    /* Setup a handler for the control register. */
> > +    header->command = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
> 
> As the code says, the register is the Command Register, so your
> comment shouldn't say "control".

My mistake.

> > +    rc = xen_vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write,
> > +                               PCI_COMMAND, 2, header);
> > +    if ( rc )
> > +    {
> > +        dprintk(XENLOG_ERR,
> > +                "%04x:%02x:%02x.%u: failed to add handler register %#x: 
> > %d\n",
> > +                seg, bus, slot, func, PCI_COMMAND, rc);
> > +        return rc;
> > +    }
> > +
> > +    for ( i = 0; i < num_bars; i++ )
> > +    {
> > +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> > +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> > +        uint64_t addr, size;
> > +        unsigned int index;
> > +
> > +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> > +        {
> > +            bars[i].type = VPCI_BAR_MEM64_HI;
> > +            bars[i].unset = bars[i - 1].unset;
> > +            continue;
> 
> Neither here nor below you install a handler for this upper half.

Ugh, good catch.

> > +        }
> > +        else if ( (val & PCI_BASE_ADDRESS_SPACE) == 
> > PCI_BASE_ADDRESS_SPACE_IO )
> > +        {
> > +            bars[i].type = VPCI_BAR_IO;
> > +            continue;
> > +        }
> > +        else if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> 
> Pointless "else" (twice).

Removed.

> > +                  PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > +            bars[i].type = VPCI_BAR_MEM64_LO;
> > +        else
> > +            bars[i].type = VPCI_BAR_MEM;
> > +
> > +        /* Size the BAR and map it. */
> > +        index = i;
> > +        rc = pci_size_bar(seg, bus, slot, func, PCI_BASE_ADDRESS_0, 
> > num_bars,
> > +                          &index, &addr, &size);
> > +        if ( rc )
> > +        {
> > +            dprintk(XENLOG_ERR,
> > +                    "%04x:%02x:%02x.%u: unable to size BAR#%u: %d\n",
> > +                    seg, bus, slot, func, i, rc);
> > +            return rc;
> > +        }
> > +
> > +        if ( size == 0 )
> > +        {
> > +            bars[i].type = VPCI_BAR_EMPTY;
> > +            continue;
> > +        }
> > +
> > +        if ( (bars[i].type == VPCI_BAR_MEM && addr == GENMASK(31, 12)) ||
> > +             addr == GENMASK(63, 26) )
> 
> Where is this 26 coming from?
> 
> Perhaps
> 
>     if ( addr == GENMASK(bars[i].type == VPCI_BAR_MEM ? 31 : 63, 12) )

I'm checking the memory decode bit here instead in order to figure out
if the BAR is not positioned.

> ? Albeit I'm unconvinced GENMASK() is useful to be used here anyway
> (see also below).

Right, regardless of the specific usage above, what would you
recommend regarding the usage of GENMASK?

Julien suggested introducing GENMASK_ULL. Should I go that route, or
introduce something locally for vPCI?

> > +        {
> > +            /* BAR is not positioned. */
> 
> I can't find anything in the standard saying that all-ones upper
> address bits indicate an unassigned BAR. As long as the memory
> decode bit is off, all BARs are to be considered unassigned afaik.
> Furthermore you can't possibly read e.g. 0xfffff000 from a
> 32-bit BAR covering more than 4k.

OK, so I've now changed this to mark the BAR as unset if the memory
decode bit in the command register is not set.

> > +            bars[i].unset = true;
> > +            ASSERT(is_hardware_domain(pdev->domain));
> > +            ASSERT(!(header->command & PCI_COMMAND_MEMORY));
> 
> You're asserting guest controlled state here (even if it's Dom0).
> 
> > +        }
> > +
> > +        ASSERT(IS_ALIGNED(addr, PAGE_SIZE));
> 
> Urgh (again).

Removed both of the above.

> > --- a/xen/include/xen/vpci.h
> > +++ b/xen/include/xen/vpci.h
> > @@ -50,6 +50,34 @@ int xen_vpci_write(unsigned int seg, unsigned int bus, 
> > unsigned int devfn,
> >  struct vpci {
> >      /* Root pointer for the tree of vPCI handlers. */
> >      struct rb_root handlers;
> > +
> > +#ifdef __XEN__
> > +    /* Hide the rest of the vpci struct from the user-space test harness. 
> > */
> > +    struct vpci_header {
> > +        /* Cached value of the command register. */
> > +        uint16_t command;
> > +        /* Information about the PCI BARs of this device. */
> > +        struct vpci_bar {
> > +            enum {
> > +                VPCI_BAR_EMPTY,
> > +                VPCI_BAR_IO,
> > +                VPCI_BAR_MEM,
> 
> MEM32?

Changed.

> > +                VPCI_BAR_MEM64_LO,
> > +                VPCI_BAR_MEM64_HI,
> > +            } type;
> > +            /* Hardware address. */
> > +            paddr_t paddr;
> > +            /* Guest address where the BAR should be mapped. */
> > +            paddr_t gaddr;
> > +            /* Current guest address where the BAR is mapped. */
> > +            paddr_t mapped_addr;
> 
> Why do you need to track both "should be" and "is" addresses? Also
> I think all three would more naturally be frame numbers.

I think I can use a single field to store the address.

> > +            size_t size;
> 
> Is this enough for e.g. ARM32 (remember this is a common
> header)?

No, I've changed it to uint64_t.

> > +            unsigned int attributes:4;
> 
> ???

Changed this to "bool prefetchable" instead.

> > +            bool sizing;
> > +            bool unset;
> 
> Isn't this redundant with e.g. gaddr (or as per above gfn) being
> INVALID_PADDR (INVALID_GFN)?

Yes, now removed.

> > +        } bars[6];
> 
> What about the ROM and SR-IOV ones?

I've implemented support for the expansion ROM BAR (which I still need
to figure out how to test), but I would like to defer SR-IOV for later
because it involves a non-trivial amount of work, and with this series
one can already boot a PVH Dom0 (minus SR-IOV of course).

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.