[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Provide support for multiple frame buffers in Xen.


  • To: "Tim (Xen.org)" <tim@xxxxxxx>
  • From: Robert Phillips <robert.phillips@xxxxxxxxxx>
  • Date: Wed, 7 Nov 2012 15:36:17 -0500
  • Accept-language: en-US
  • Acceptlanguage: en-US
  • Cc: "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>
  • Delivery-date: Wed, 07 Nov 2012 20:37:25 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xen.org>
  • Thread-index: Ac24PKA5ayC8XObNQZ+VpcHo9Jvt8AEuwCHA
  • Thread-topic: [Xen-devel] [PATCH] Provide support for multiple frame buffers in Xen.

Hi Tim,

Thank you for your in-depth review.
Concurrent with this email I will submit a revised patch (version 2) containing 
the changes I mention below.

-- rsp

> -----Original Message-----
> From: Tim Deegan [mailto:tim@xxxxxxx]
> Sent: Thursday, November 01, 2012 10:25 AM
> To: Robert Phillips
> Cc: xen-devel@xxxxxxxxxxxxx; Robert Phillips
> Subject: Re: [Xen-devel] [PATCH] Provide support for multiple frame buffers
> in Xen.
>
> Hi,
>
> At 14:15 -0400 on 16 Oct (1350396902), Robert Phillips wrote:
> > From: Robert Phillips <robert.phillips@xxxxxxxxxxxxxxxxxxx>
> >
> > Support is provided for both shadow and hardware assisted paging (HAP)
> modes.
> > This code bookkeeps the set of video frame buffers (vram),
> > detects when the guest has modified any of those buffers and, upon
> request,
> > returns a bitmap of the modified pages.
> >
> > This lets other software components re-paint the portions of the monitor
> (or monitors) that have changed.
> > Each monitor has a frame buffer of some size at some position in guest
> physical memory.
> > The set of frame buffers being tracked can change over time as monitors
> are plugged and unplugged.
>
> Having read through this in detail, it's looking very plausible. :)
> A few style nits:

[ Done

>  - please use the Xen spacings around 'if ( foo )' and 'for ( x; y; z )';
>  - there's a bit of trailing whitespace in the new file, and a few
>    places where indentation seems to have gone a bit wrong;
>  - please make sure the whole thing is linewrapped to <80 characters; and
>  - there's no need for braces around single-line blocks.

] Done

>
> More substantive comments:
>  - I think the dirty_vram.c and dirty_vram.h files belong under mm/
>    rather than under hvm/.  The ``#include "../mm/mm-locks.h"'' is
>    an indicator that this is really MM code.

I have moved dirty_vram.c under mm/
I have moved dirty_vram.h under include/asm-x86 .  In that location it is 
available to modules like hvm.c

>  - Please use xzalloc() rather than xmalloc() + memset(0).  It avoids
>    the sizes of alloc and memset getting out of sync.

Done

>  - The i386 build is dead, so you can drop some #ifdef __i386__ sections.

Done

>  - There really ought to be some limit on how many PTEs you're willing
>    to track.  Otherwise a large guest can consume lots and lots of Xen's
>    memory by making lots of PTEs that point to framebuffers.  That
>    might also lead to performance problems, e.g. in the unshadow
>    function that walks over all those liked lists.

Done but see note below in the comment starting "This needs to signal some sort 
of error".

>    Also, I think that the memory for the paddr_links ought to come from
>    the shadow pool (i.e. using domain->arch.paging.alloc_page())
>    rather than soaking up otherwise free memory.

Done

>
> A few other detailed comments below...
>
> > +/* Free a paddr_link struct, given address of its predecessor in linked 
> > list
> */
> > +dv_paddr_link_t *
> > +free_paddr_link(struct domain *d,
> > +                dv_paddr_link_t **ppl,
> > +                dv_paddr_link_t *pl)
> > +{
> > +    dv_dirty_vram_t *dirty_vram = d->arch.hvm_domain.dirty_vram;
> > +    dv_paddr_link_t *npl; /* next pl */
> > +
> > +    ASSERT( paging_locked_by_me(d) );
> > +    /* extension mapping? */
> > +    if (ppl) /* yes. free it */
> > +    {
> > +        pl = (*ppl);
>
> This assignment seems like it should always be a noop.  Would it be
> correct to replace it with ASSERT(pl == *ppl)?

Yes,  Done.

>
> > +        (*ppl) = npl = pl->pl_next;
> > +    }
> > +    else  /* main table */
> > +    {
> > +        /* move 2nd mapping to main table.
> > +         * and free 2nd mapping */
> > +        dv_paddr_link_t * spl;
> > +        spl = pl->pl_next;
> > +        if (spl == NULL)
> > +        {
> > +            pl->sl1ma = INVALID_PADDR;
> > +            return pl;
> > +        }
> > +        pl->sl1ma = spl->sl1ma;
> > +        pl->pl_next = spl->pl_next;
> > +        npl = pl; /* reprocess main table entry again */
> > +        pl = spl;
>
> OK, that took a lot of staring at to be sure it's right. :)

I added some comments.  Reading code shouldn't be so taxing.

> I'd be
> inclined to just put all paddr_links in the linked list (and have an
> array of pointers rather than an array of paddr_link_ts).  Is it worth
> having the extra complexity here, and at the callers, to avoid a single
> memory read?

Almost all frame buffer pages have a single mapping so the algorithm's common 
case is met by constructing pl_tab as an array of paddr_links.
That is, the table's paddr_links will rarely point to a chain of extension 
paddr_links.
The code encapsulates the complexity in a single function.  Sorry it required 
so much staring.
Even if all paddr_links were stored in a linked list, the caller would have to 
walk that linked list, so would be just as complex.

>
> > +    }
> > +    pl->sl1ma = INVALID_PADDR;
> > +    pl->pl_next = dirty_vram->pl_free;
> > +    dirty_vram->pl_free = pl;
> > +    return npl;
> > +}
> > +
> > +
> > +/* dirty_vram_range_update()
> > + * This is called whenever a level 1 page table entry is modified.
> > + * If the L1PTE is being cleared, the function removes any paddr_links
> > + * that refer to it.
> > + * If the L1PTE is being set to a frame buffer page, a paddr_link is
> > + * created for that page's entry in pl_tab.
> > + * Returns 1 iff entry found and set or cleared.
> > + */
> > +int dirty_vram_range_update(struct domain *d,
> > +                            unsigned long gfn,
> > +                            paddr_t sl1ma,
> > +                            int set)
> > +{
> > +    int effective = 0;
> > +    dv_range_t *range;
> > +
> > +    ASSERT(paging_locked_by_me(d));
> > +    range = dirty_vram_range_find_gfn(d, gfn);
> > +    if ( range )
> > +    {
>
> I think this would be more readable as 'if ( !range ) return 0' here
> rather than indenting most of the function.

Done.

>
> > +        unsigned long i = gfn - range->begin_pfn;
> > +        dv_paddr_link_t *pl = &range->pl_tab[ i ];
> > +        dv_paddr_link_t **ppl = NULL;
> > +        int len = 0;
> > +
> > +        /* find matching entry (pl), if any, and its predecessor
> > +         * in linked list (ppl) */
> > +        while (pl != NULL)
> > +        {
> > +            if (pl->sl1ma == sl1ma || pl->sl1ma == INVALID_PADDR )
> > +                break;
> > +            ppl = &pl->pl_next;
> > +            pl = *ppl;
> > +            len++;
> > +        }
> > +
> > +        if (set)
> > +        {
> > +            /* Did we find sl1ma in either the main table or the linked 
> > list? */
> > +            if (pl == NULL) /* no, so we'll need to alloc a link */
> > +            {
> > +                ASSERT(ppl != NULL);
> > +                /* alloc link and append it to list */
> > +                (*ppl) = pl = alloc_paddr_link(d);
> > +                if (pl == NULL)
> > +                    goto out;
>
> This needs to signal some sort of error.  Otherwise, if we can't add
> this sl1e to the list we'll just silently fail to track it.

I don't know what sort of error we can signal.
The immediate symptom would be that areas of the monitor would not be refreshed.
But, since we're running out of memory, that might be the least of the quests's 
woes.

The updated patch actually makes the symptom more likely (though still very 
unlikely) by
putting an arbitrary bound on the length of paddr_link chains.
It handles a rogue process, one that has an arbitrarily large number of 
mappings for
frame buffer pages, by simply not recording the excessive mappings.
If that results in unrefreshed blocks on the monitor, so be it.

>
> > +            }
> > +            if ( pl->sl1ma != sl1ma )
> > +            {
>
> ASSERT(pl->sl1ma == INVALID_PADDR) ?

Yes, Done.

>
> > +                pl->sl1ma = sl1ma;
> > +                range->nr_mappings++;
> > +            }
> > +            effective = 1;
> > +            if (len > range->mappings_hwm)
> > +            {
> > +                range->mappings_hwm = len;
> > +#if DEBUG_update_vram_mapping
> > +                gdprintk(XENLOG_DEBUG,
> > +                         "[%lx] set      sl1ma:%lx hwm:%d mappings:%d
> freepages:%d\n",
> > +                         gfn, sl1ma,
> > +                         range->mappings_hwm,
> > +                         range->nr_mappings,
> > +                         d->arch.paging.shadow.free_pages);
> > +#endif
> > +            }
> > +        }
> > +        else /* clear */
> > +        {
> > +            if (pl && pl->sl1ma == sl1ma )
> > +            {
> > +#if DEBUG_update_vram_mapping
> > +                gdprintk(XENLOG_DEBUG,
> > +                         "[%lx] clear    sl1ma:%lx mappings:%d\n",
> > +                         gfn, sl1ma,
> > +                         range->nr_mappings-1);
> > +#endif
> > +                free_paddr_link(d, ppl, pl);
> > +                if ( --range->nr_mappings == 0 )
> > +                {
> > +                    dirty_vram_range_free(d, range);
>
> What's this for?  If the guest unmaps the framebuffer and remaps it (or
> if the shadow PTs of the mappings are temporarily discarded) this will
> stop us from tracking the new mappings until the toolstack asks for the
> bitmap (and then it will be expensive to go and find the mappings).
>

I don't see this happening.  If the guest unmaps the framebuffer, the shadow
code lazily recovers the shadow pages, so tracking will continue until it 
decides a page
is no longer a shadow page.  That is when this code is invoked.

It tears down the mappings to that page and if some range ends up with no 
mappings then the range is useless.
Vram ranges are generated willy-nilly as needed.   This is the only mechanism 
for cleaning them up.

> > +                }
> > +                effective = 1;
> > +            }
> > +        }
> > +    }
> > + out:
> > +    return effective;
> > +}
>
> > +/* shadow_track_dirty_vram()
> > + * This is the API called by the guest to determine which pages in the
> range
> > + * from [begin_pfn:begin_pfn+nr) have been dirtied since the last call.
> > + * It creates the domain's dv_dirty_vram on demand.
> > + * It creates ranges on demand when some [begin_pfn:nr) is first
> encountered.
> > + * To collect the dirty bitmask it calls shadow_scan_dirty_flags().
> > + * It copies the dirty bitmask into guest storage.
> > + */
> > +int shadow_track_dirty_vram(struct domain *d,
> > +                            unsigned long begin_pfn,
> > +                            unsigned long nr,
> > +                            XEN_GUEST_HANDLE_64(uint8) guest_dirty_bitmap)
> > +{
> > +    int rc = 0;
> > +    unsigned long end_pfn = begin_pfn + nr;
> > +    int flush_tlb = 0;
> > +    dv_range_t *range;
> > +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> > +
> > +    if (end_pfn < begin_pfn
> > +            || begin_pfn > p2m->max_mapped_pfn
> > +            || end_pfn >= p2m->max_mapped_pfn)
>
> I know you just copied this from the old definition but the limits seem
> wrong here -- I think it should be:
>
>     if ( end_pfn < begin_pfn || end_pfn > p2m->max_mapped_pfn + 1 )

You're right.  Done.  And commented because it's pretty obscure.

>
>
> > +/* hap_clean_vram_tracking_range()
> > + * For all the pages in the range specified by [begin_pfn,nr),
> > + * note in the dirty bitmap any page that has been marked as read-write,
> > + * which signifies that the page has been dirtied, and reset the page
> > + * to ram_logdirty.
> > + */
> > +void hap_clean_vram_tracking_range(struct domain *d,
> > +                                   unsigned long begin_pfn,
> > +                                   unsigned long nr,
> > +                                   uint8_t *dirty_bitmap)
> > +{
> > +    int i;
> > +    unsigned long pfn;
> > +    dv_dirty_vram_t *dirty_vram = d->arch.hvm_domain.dirty_vram;
> > +    dv_range_t *range;
> > +
> > +    ASSERT(p2m_locked_by_me(p2m_get_hostp2m(d)));
> > +    ASSERT(paging_locked_by_me(d));
> > +
> > +    if ( !dirty_vram )
> > +    {
> > +        gdprintk(XENLOG_DEBUG, "Should only be called while tracking dirty
> vram.\n");
> > +        return;
> > +    }
> > +
> > +    range = dirty_vram_range_find(d, begin_pfn, nr);
> > +    if (!range)
> > +        return;
>
> Oughtn't we to return all 1s in the bitmap here?  If the range isn't
> currently being tracked we should conservatively assume it's all dirty,
> right?

The callers will should ensure that the range exists.  This is just a 
conservative test.

Anyway, it begs the question of whether a new range should be considered all
dirty or all clean.  Intuitively "all dirty" seems the right answer but in 
practice
it works fine as written since the guest is busy updating the
whole frame buffer.

>
> > +
> > +    /* set l1e entries of P2M table to be read-only. */
> > +    /* On first write, it page faults, its entry is changed to read-write,
> > +     * its bit in the dirty bitmap is set, and on retry the write 
> > succeeds. */
> > +    for (i = 0, pfn = range->begin_pfn; pfn < range->end_pfn; i++, pfn++)
> > +    {
> > +        p2m_type_t pt;
> > +        pt = p2m_change_type(d, pfn, p2m_ram_rw, p2m_ram_logdirty);
> > +        if (pt == p2m_ram_rw)
> > +            dirty_bitmap[i >> 3] |= (1 << (i & 7));
> > +    }
> > +    flush_tlb_mask(d->domain_dirty_cpumask);
> > +}
> > +
> > +static void hap_vram_tracking_init(struct domain *d)
> > +{
> > +    paging_log_dirty_init(d, hap_enable_vram_tracking,
> > +                          hap_disable_vram_tracking,
> > +                          NULL);
> > +}
> > +
> > +/* hap_track_dirty_vram()
> > + * Create the domain's dv_dirty_vram struct on demand.
> > + * Create a dirty vram range on demand when some
> [begin_pfn:begin_pfn+nr] is first encountered.
> > + * Collect the guest_dirty bitmask, a bit mask of the dirties vram pages, 
> > by
> > + * calling paging_log_dirty_range().
> > + */
> > +int hap_track_dirty_vram(struct domain *d,
> > +                         unsigned long begin_pfn,
> > +                         unsigned long nr,
> > +                         XEN_GUEST_HANDLE_64(uint8) guest_dirty_bitmap)
> > +{
> > +    long rc = 0;
> > +    dv_dirty_vram_t *dirty_vram;
> > +    int restart_log_dirty = 0;
> > +
> > +    paging_lock(d);
> > +    dirty_vram = d->arch.hvm_domain.dirty_vram;
> > +    if ( nr )
> > +    {
> > +        dv_range_t *range = NULL;
> > +        int size = (nr + BITS_PER_LONG - 1) / BITS_PER_LONG;
> > +        unsigned long dirty_bitmap[size];
>
> All the users of this array cast to (uint8_t *) -- just declare it as
> uint8_t * instead?

Yes, done.  The original code seemed overly fond of using LONGs ...

>
> > +
> > +        /* Already tracking dirty vram? */
> > +        if ( paging_mode_log_dirty(d) && dirty_vram ) /* yes */
> > +        {
> > +            /* Handle the addition of another range */
> > +            range = dirty_vram_range_find(d, begin_pfn, nr);
> > +            if ( !range )
> > +            {
> > +                rc = -ENOMEM;
> > +                if ( !(range = dirty_vram_range_alloc(d, begin_pfn, nr)) )
> > +                    goto param_fail;
> > +                restart_log_dirty = 1;
> > +            }
> > +        }
> > +        /* Just starting to track dirty vram? */
> > +        else if ( !paging_mode_log_dirty(d) && !dirty_vram ) /* yes */
> > +        {
> > +            rc = -ENOMEM;
> > +            if ( !(dirty_vram = dirty_vram_alloc(d)) )
> > +                goto param_fail;
> > +
> > +            if ( !(range = dirty_vram_range_find_or_alloc(d, begin_pfn, 
> > nr)) )
> > +                goto param_fail;
> > +
> > +            restart_log_dirty = 1;
> > +            /* Initialize callbacks for vram tracking */
> > +            hap_vram_tracking_init(d);
> > +        }
> > +        else
> > +        {
> > +            /* Test for invalid combination */
> > +            if ( !paging_mode_log_dirty(d) && dirty_vram )
> > +                rc = -EINVAL;
> > +            else /* logging dirty of all memory, not tracking dirty vram */
> > +                rc = -ENODATA;
> > +            goto param_fail;
> > +        }
> > +
> > +        if (restart_log_dirty)
> > +        {
> > +            /* disable then enable log dirty */
>
> Why disable and re-enable?  The call to paging_log_dirty_range() below
> will reset the p2m entries of the range you care about, so I think all
> you need to do is enable it in the 'just starting' case above.

Done.

>
> And, since you know you're in HAP mode, not in log-dirty mode, and
> already have the paging lock, you can just set
> d->arch.paging.mode |= PG_log_dirty there rather than jumping through
> the paging_log_dirty_enable() path and messing with locks.

No, paging_log_dirty_enable() goes through several layers of functions and
ends up calling hap_enable_vram_tracking(), which does quite a bit of stuff.

>
> > +            paging_unlock(d);
> > +            if (paging_mode_log_dirty(d))
> > +                paging_log_dirty_disable(d);
> > +
> > +            rc = paging_log_dirty_enable(d);
> > +            paging_lock(d);
> > +            if (rc != 0)
> > +                goto param_fail;
> > +        }
> > +
> > +        paging_unlock(d);
> > +        memset(dirty_bitmap, 0x00, size * BYTES_PER_LONG);
> > +   paging_log_dirty_range(d, begin_pfn, nr, (uint8_t*)dirty_bitmap);
> > +        rc = -EFAULT;
> > +        if ( copy_to_guest(guest_dirty_bitmap,
> > +                           (uint8_t*)dirty_bitmap,
> > +                           size * BYTES_PER_LONG) == 0 )
> > +        {
> > +            rc = 0;
> > +        }
> > +    }
> > +    else
> > +    {
> > +        /* If zero pages specified while already tracking dirty vram
> > +         * then stop tracking */
> > +        if ( paging_mode_log_dirty(d) && dirty_vram ) {
> > +            paging_unlock(d);
> > +            rc = paging_log_dirty_disable(d);
> > +            paging_lock(d);
> > +            dirty_vram_free(d);
>
> This is different from the shadow case -- there, IIUC, you just ignore
> requests where nr == 0; here, you tear down all vram tracking.
> Can you choose one of those, and document it in the public header?

Done.  I changed the shadow code to tear down if nr == 0.
(The HAP code seemed to expect that behavior and the shadow code didn't
seem to care.)
And updated the public header.

>
> > +        } else /* benign no-op */
> > +        {
> > +            rc = 0;
> > +        }
> > +        paging_unlock(d);
> > +    }
> > +
> > +    return rc;
>
>
> > +/* paging_mark_dirty_hap()
> > + * Make a hap page writeable and mark it as dirty.
> > + * This done atomically under the p2m and paging locks to avoid leaving
> > + * a window where the page might be modified without being marked as
> dirty.
> > + */
>
> I'm perplexed by this -- AFAICT it's either not necessary (because all
> log-dirty read/clean ops are done with the domain paused) or not sufficient
> (because although the bitmap and the PTE are updated under the p2m lock,
> the actual dirtying of the page happens at some other time).  Can you
> spell out for me exactly what this is protecting against?

The comment over-stated the problem so I've toned it down
from "without being marked" to "without being counted".

paging_mark_dirty_hap() is in the page fault path and has two steps:

(1) It calls p2m_change_type() to re-mark some pfn as writeable (i.e. 
p2m_ram_rw),
which is a no-op if the pfn is already writeable.
This must be done under the p2m_lock.

(2) If the pfn was previously read-only (i.e. p2m_ram_logdirty) then it bumps 
two
dirty counts.  And it marks the page as dirty.
This must be done under the paging lock.

As an invariant, the dirty counts should be precisely the number of pages
made writeable.

With this patch, step (2) is also done under the p2m_lock.  This avoids having
a window between steps (1) and (2), in which hap_track_dirty_vram() might
get in and would break the invariant by clearing the range's dirty count.

It may be that this invariant is not particularly useful, that things will just 
work out.
But I do wonder if, without it, we'll have situations where the dirty counts 
will indicate
dirty pages when there are none, or no dirty pages when there are some.

>
> > +typedef int (*hash_pfn_callback_t)(struct vcpu *v,
> > +                                   mfn_t smfn,
> > +                                   unsigned long begin_pfn,
> > +                                   unsigned long end_pfn,
> > +                                   int *removed);
> > +
> > +static int hash_pfn_foreach(struct vcpu *v,
> > +                            unsigned int callback_mask,
> > +                            hash_pfn_callback_t callbacks[],
> > +                            unsigned long begin_pfn,
> > +                            unsigned long end_pfn)
> > +/* Walk the hash table looking at the types of the entries and
> > + * calling the appropriate callback function for each entry.
> > + * The mask determines which shadow types we call back for, and the
> array
> > + * of callbacks tells us which function to call.
> > + * Any callback may return non-zero to let us skip the rest of the scan.
> > + *
> > + * WARNING: Callbacks MUST NOT add or remove hash entries unless
> they
> > + * then return non-zero to terminate the scan. */
>
> This code duplication is a bit much.  I think you should recast the
> existing hash_foreach() function to take a pointer as its fourth
> argument instead of an MFN, and then make the existing callers just cast
> their MFN argument as a pointer.   The you can use the same function,
> passing a pointer to a struct { begin, end, removed }.
>
> Please make the changes to hash_foreach() in a separate patch from the
> dirty_vram stuff.

Yes, I'd be happy to submit a patch as you suggest after we're done with this 
one.

>
> > @@ -1211,12 +1164,14 @@ static int shadow_set_l1e(struct vcpu *v,
> >                  shadow_l1e_remove_flags(new_sl1e, _PAGE_RW);
> >                  /* fall through */
> >              case 0:
> > -                shadow_vram_get_l1e(new_sl1e, sl1e, sl1mfn, d);
> > +                shadow_vram_fix_l1e(old_sl1e, new_sl1e, sl1e, sl1mfn, d);
> >                  break;
> >              }
> >          }
> >      }
> >
> > +    shadow_vram_fix_l1e(old_sl1e, new_sl1e, sl1e, sl1mfn, d);
>
> Why is this being called twice here?

Stupidity.  I've eliminated the first of the two.

>
> Cheers,
>
> Tim.

Thanks again.
-- rsp

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.