[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC/PATCH] Improve speed of mapping guest memory into Dom0



On 14/11/12 13:08, David Vrabel wrote:
Mats,

Your patch has been white-space damaged and will not apply.  You should
use git send-email which will do the right thing.  I've also CC'd Konrad
who is the maintainer for the Xen parts of the kernel.
Thanks for the advice. Will use git send-email for V2 of this patch.
On 14/11/12 11:13, Mats Petersson wrote:
[a long, rambling commit message?]
The text as-is isn't really suitable for a commit message (too much
irrelevant stuff) and there is no suitable subject line.

I have also found that the munmap() call used to unmap the guest memory
from Dom0 is about 35% slower in 3.7 kernel than in the 2.6 kernel (3.8M
cycles vs 2.8M cycles).
This performance reduction only occurs with 32-bit guests is the Xen
then traps-and-emulates both halves of the PTE write.

I think this could be made quicker by using a
direct write of zero rather than the compare exchange operation that is
currently used [which traps into Xen, performs the compare & exchange] -
This is something I noticed but never got around to producing a patch.
How about this (uncomplied!) patch?

-- a/mm/memory.c
+++ b/mm/memory.c
@@ -1146,8 +1146,16 @@ again:
                                     page->index > details->last_index))
                                        continue;
                        }
-                       ptent = ptep_get_and_clear_full(mm, addr, pte,
-                                                       tlb->fullmm);
+                       /*
+                        * No need for the expensive atomic get and
+                        * clear for anonymous mappings as the dirty
+                        * and young bits are not used.
+                        */
+                       if (PageAnon(page))
+                               pte_clear(mm, addr, pte);
+                       else
+                               ptent = ptep_get_and_clear_full(mm, addr, pte,
+                                                               tlb->fullmm);
                        tlb_remove_tlb_entry(tlb, pte, addr);
                        if (unlikely(!page))
                                continue;
I came up with something very similar, however found that page is sometimes NULL which causes problems. if (page && PageAnon(page)) will probably work... Trying that now.

Now for the patch itself.  On the whole, the approach looks good and the
real-word performance improvements are nice.  Specific comments inline.

--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -2542,3 +2561,77 @@ out:
      return err;
  }
  EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range);
+
+/* like xen_remap_domain_mfn_range, but does a list of mfn's, rather
+ * than the, for xen, quite useless, consecutive pages.
+ */
/*
  * Like xen_remap_domain_mfn_range(), but more efficiently handles MFNs
  * that are not contiguous (which is common for a domain's memory).
  */
Will update.
+int xen_remap_domain_mfn_list(struct vm_area_struct *vma,
+                  unsigned long addr,
+                  unsigned long *mfn, int nr,
+                  int *err_ptr,
+                  pgprot_t prot, unsigned domid)
xen_remap_domain_mfn_array() ?  It's not taking a list.
Yes, better name. Although, perhaps following Ian Campbell's suggestion and changing the xen_remain_domain_mfn_range to "do the right thing" is even better. I will have a quick look at how difficult that may be - I feel it may be rather simpler than I first thought, as there is only one call to this function in privcmd.c.

The arm/xen/enlighten.c will probably also need fixing, I suppose...

+{
+    struct remap_list_data rmd;
+    struct mmu_update mmu_update[REMAP_BATCH_SIZE];
This is surprisingly large (256 bytes) but I note that the existing
xen_remap_domain_mfn_range() does the same thing so I guess it's ok.
I would actually like to increase it, but that would probably require allocation. One HV call per 32
+    int batch;
+    int done;
+    unsigned long range;
+    int err = 0;
+
+    if (xen_feature(XENFEAT_auto_translated_physmap))
+        return -EINVAL;
+
+    prot = __pgprot(pgprot_val(prot) | _PAGE_IOMAP);
+
+    BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP |
VM_IO)));
+
+    rmd.mfn = mfn;
+    rmd.prot = prot;
+
+    while (nr) {
+        batch = min(REMAP_BATCH_SIZE, nr);
+        range = (unsigned long)batch << PAGE_SHIFT;
+
+        rmd.mmu_update = mmu_update;
+        err = apply_to_page_range(vma->vm_mm, addr, range,
+                      remap_area_mfn_list_pte_fn, &rmd);
+        if (err)
+        {
+            printk("xen_remap_domain_mfn_list: apply_to_range:
err=%d\n", err);
Stray printk?
Yes.
+            goto out;
+        }
+
+        err = HYPERVISOR_mmu_update(mmu_update, batch, &done, domid);
+        if (err < 0)
+        {
+            int i;
+            /* TODO: We should remove this printk later */
+            printk("xen_remap_domain_mfn_list: mmu_update: err=%d,
done=%d, batch=%d\n", err, done, batch);
Yes, you should...

+            err_ptr[done] = err;
+
+            /* now do the remaining part of this batch */
+            for(i = done+1; i < batch; i++)
+            {
+                int tmp_err = HYPERVISOR_mmu_update(&mmu_update[i], 1,
NULL, domid);
+                if (tmp_err < 0)
+                {
+                    err_ptr[i] = tmp_err;
+                }
+            }
There's no need to fall back to doing it page-by-page here. You can
still batch the remainder.
Yes, I agree.
+
+            goto out;
+        }
+
+        nr -= batch;
+        addr += range;
+        err_ptr += batch;
+    }
+
+    err = 0;
+out:
+
+    xen_flush_tlb_all();
Probably not that important anymore since we would now do far fewer TLB
flushes, but this TLB flush is only needed by the PTEs being updated
were already present -- if they're all clear then TLB flush can be
omitted entirely.
The 2.6 kernel actually had a check BUG_ON(!pte_none(pte)) [or something like that], but the 3.x kernel seems to have "lost" that. Either way, once per, say, 1024 pages isn't measurable in my benchmarks.

+
+    return err;
+}
+EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_list);
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 8adb9cc..b39a7b7 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -151,6 +151,41 @@ static int traverse_pages(unsigned nelem, size_t size,
      return ret;
  }

+/*
+ * Similar to traverse_pages, but use each page as a "block" of
+ * data to be processed as one unit.
+ */
+static int traverse_pages_block(unsigned nelem, size_t size,
+                struct list_head *pos,
+                int (*fn)(void *data, int nr, void *state),
+                void *state)
+{
+    void *pagedata;
+    unsigned pageidx;
+    int ret = 0;
+
+    BUG_ON(size > PAGE_SIZE);
+
+    pageidx = PAGE_SIZE;
+    pagedata = NULL;    /* hush, gcc */
What was gcc upset about?  I don't see anything it could get confused about.
This comes from the original traverse_pages code, which does this. I will see if it compiles without warnings if I remove the initialization.

@@ -260,17 +295,17 @@ struct mmap_batch_state {
      xen_pfn_t __user *user_mfn;
  };

-static int mmap_batch_fn(void *data, void *state)
+static int mmap_batch_fn(void *data, int nr, void *state)
  {
      xen_pfn_t *mfnp = data;
+
      struct mmap_batch_state *st = state;
      int ret;

-    ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK,
*mfnp, 1,
-                     st->vma->vm_page_prot, st->domain);
+    BUG_ON(nr < 0);
Is this BUG_ON() useful?
I think very bad things will happen when nr is less than zero, but I guess they will be pretty obvious... So perhaps just letting it die from the consequences. It does "while(nr)" in the remap function, and then min(REMAP_BATCH_SIZE, nr) an various other things that will definitely not work as expected if nr is negative. I thought it's much easier to detect when it's gone wrong if you assert early tho'.

--
Mats

David





_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.