Xen project Mailing List

Re: [Xen-devel] [PATCH RFC] Persistent grant maps for xen blk drivers

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Tue, 23 Oct 2012 13:20:08 -0400

Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Tue, 23 Oct 2012 17:33:05 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, Oct 23, 2012 at 06:07:36PM +0200, Roger Pau Monné wrote: > On 22/10/12 15:47, Konrad Rzeszutek Wilk wrote: > > On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote: > >> This patch implements persistent grants for the xen-blk{front,back} > >> mechanism. The effect of this change is to reduce the number of unmap > >> operations performed, since they cause a (costly) TLB shootdown. This > >> allows the I/O performance to scale better when a large number of VMs > >> are performing I/O. > >> > >> Previously, the blkfront driver was supplied a bvec[] from the request > >> queue. This was granted to dom0; dom0 performed the I/O and wrote > >> directly into the grant-mapped memory and unmapped it; blkfront then > >> removed foreign access for that grant. The cost of unmapping scales > >> badly with the number of CPUs in Dom0. An experiment showed that when > >> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a > >> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests > >> (at which point 650,000 IOPS are being performed in total). If more > >> than 5 guests are used, the performance declines. By 10 guests, only > >> 400,000 IOPS are being performed. > >> > >> This patch improves performance by only unmapping when the connection > >> between blkfront and back is broken. > >> > >> On startup blkfront notifies blkback that it is using persistent > >> grants, and blkback will do the same. If blkback is not capable of > >> persistent mapping, blkfront will still use the same grants, since it > >> is compatible with the previous protocol, and simplifies the code > >> complexity in blkfront. > >> > >> To perform a read, in persistent mode, blkfront uses a separate pool > >> of pages that it maps to dom0. When a request comes in, blkfront > >> transmutes the request so that blkback will write into one of these > >> free pages. Blkback keeps note of which grefs it has already > >> mapped. When a new ring request comes to blkback, it looks to see if > >> it has already mapped that page. If so, it will not map it again. If > >> the page hasn't been previously mapped, it is mapped now, and a record > >> is kept of this mapping. Blkback proceeds as usual. When blkfront is > >> notified that blkback has completed a request, it memcpy's from the > >> shared memory, into the bvec supplied. A record that the {gref, page} > >> tuple is mapped, and not inflight is kept. > >> > >> Writes are similar, except that the memcpy is peformed from the > >> supplied bvecs, into the shared pages, before the request is put onto > >> the ring. > >> > >> Blkback stores a mapping of grefs=>{page mapped to by gref} in > >> a red-black tree. As the grefs are not known apriori, and provide no > >> guarantees on their ordering, we have to perform a search > >> through this tree to find the page, for every gref we receive. This > >> operation takes O(log n) time in the worst case. > > > > Might want to mention how blkfront stores it as well. It looks > > to be using 'llist' instead of 'list'? Any particular reason? > > Since we are just pushing and poping grant references, I went for what I > think is the simplest one, a single linked list (list is a doubly linked > list). Oliver in the previous version was using something similar, but > custom made. I think it's best to use the data structures provided by > the kernel itself. > > > > >> > >> The maximum number of grants that blkback will persistenly map is > >> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to > >> prevent a malicios guest from attempting a DoS, by supplying fresh > >> grefs, causing the Dom0 kernel to map excessively. If a guest > >> is using persistent grants and exceeds the maximum number of grants to > >> map persistenly the newly passed grefs will be mapped and unmaped. > >> Using this approach, we can have requests that mix persistent and > >> non-persistent grants, and we need to handle them correctly. > >> This allows us to set the maximum number of persistent grants to a > >> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although > >> setting it will lead to unpredictable performance. > >> > >> In writing this patch, the question arrises as to if the additional > >> cost of performing memcpys in the guest (to/from the pool of granted > >> pages) outweigh the gains of not performing TLB shootdowns. The answer > >> to that question is `no'. There appears to be very little, if any > >> additional cost to the guest of using persistent grants. There is > >> perhaps a small saving, from the reduced number of hypercalls > >> performed in granting, and ending foreign access. > >> > >> Signed-off-by: Oliver Chick <oliver.chick@xxxxxxxxxx> > >> Signed-off-by: Roger Pau Monne <roger.pau@xxxxxxxxxx> > >> Cc: <konrad.wilk@xxxxxxxxxx> > >> Cc: <linux-kernel@xxxxxxxxxxxxxxx> > >> --- > >> Benchmarks showing the impact of this patch in blk performance can be > >> found at: > >> > >> http://xenbits.xensource.com/people/royger/persistent_grants/ > >> --- > >> drivers/block/xen-blkback/blkback.c | 279 > >> +++++++++++++++++++++++++++++++--- > >> drivers/block/xen-blkback/common.h | 17 ++ > >> drivers/block/xen-blkback/xenbus.c | 16 ++- > >> drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++---- > >> 4 files changed, 442 insertions(+), 53 deletions(-) > >> > >> diff --git a/drivers/block/xen-blkback/blkback.c > >> b/drivers/block/xen-blkback/blkback.c > >> index c6decb9..2b982b2 100644 > >> --- a/drivers/block/xen-blkback/blkback.c > >> +++ b/drivers/block/xen-blkback/blkback.c > >> @@ -78,6 +78,7 @@ struct pending_req { > >> unsigned short operation; > >> int status; > >> struct list_head free_list; > >> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> }; > >> > >> #define BLKBACK_INVALID_HANDLE (~0) > >> @@ -98,6 +99,30 @@ struct xen_blkbk { > >> static struct xen_blkbk *blkbk; > >> > >> /* > >> + * Maximum number of grant pages that can be mapped in blkback. > >> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of > >> + * pages that blkback will persistently map. > >> + */ > >> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol > >> protocol) > >> +{ > >> + switch (protocol) { > >> + case BLKIF_PROTOCOL_NATIVE: > >> + return __CONST_RING_SIZE(blkif, PAGE_SIZE) * > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; > >> + case BLKIF_PROTOCOL_X86_32: > >> + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) * > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; > >> + case BLKIF_PROTOCOL_X86_64: > >> + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) * > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; > > > > Could you include in the comments what the size (bytes) you expect it to be? > > If that would require you re-doing some tests - then don't worry - but > > I figured you have some notes scribbled away that have the exact values > > down. > > As far as I know and remember (I've checked the ring size in the past), > all ring types have a size of 32, BLKIF_MAX_SEGMENTS_PER_REQUEST is > always 11, and sizeof(struct persistent_gnt) is 48, so that's 32 * 11 * > 48 = 16896 bytes. I will add a comment with this calculation. > > > > >> + default: > >> + BUG(); > >> + } > >> + return 0; > >> +} > >> + > >> + > >> +/* > >> * Little helpful macro to figure out the index and virtual address of the > >> * pending_pages[..]. For each 'pending_req' we have have up to > >> * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 > >> through > >> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif > >> *blkif, > >> static void make_response(struct xen_blkif *blkif, u64 id, > >> unsigned short op, int st); > >> > >> +#define foreach_grant(pos, rbtree, node) \ > >> + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); > >> \ > >> + &(pos)->node != NULL; \ > >> + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), > >> node)) > >> + > >> + > >> +static void add_persistent_gnt(struct rb_root *root, > >> + struct persistent_gnt *persistent_gnt) > >> +{ > >> + struct rb_node **new = &(root->rb_node), *parent = NULL; > >> + struct persistent_gnt *this; > >> + > >> + /* Figure out where to put new node */ > >> + while (*new) { > >> + this = container_of(*new, struct persistent_gnt, node); > >> + > >> + parent = *new; > >> + if (persistent_gnt->gnt < this->gnt) > >> + new = &((*new)->rb_left); > >> + else if (persistent_gnt->gnt > this->gnt) > >> + new = &((*new)->rb_right); > >> + else { > >> + pr_alert(DRV_PFX " trying to add a gref that's > >> already in the tree\n"); > >> + BUG(); > >> + } > >> + } > >> + > >> + /* Add new node and rebalance tree. */ > >> + rb_link_node(&(persistent_gnt->node), parent, new); > >> + rb_insert_color(&(persistent_gnt->node), root); > >> +} > >> + > >> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root, > >> + grant_ref_t gref) > >> +{ > >> + struct persistent_gnt *data; > >> + struct rb_node *node = root->rb_node; > >> + > >> + while (node) { > >> + data = container_of(node, struct persistent_gnt, node); > >> + > >> + if (gref < data->gnt) > >> + node = node->rb_left; > >> + else if (gref > data->gnt) > >> + node = node->rb_right; > >> + else > >> + return data; > >> + } > >> + return NULL; > >> +} > >> + > >> /* > >> * Retrieve from the 'pending_reqs' a free pending_req structure to be > >> used. > >> */ > >> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg) > >> { > >> struct xen_blkif *blkif = arg; > >> struct xen_vbd *vbd = &blkif->vbd; > >> + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct persistent_gnt *persistent_gnt; > >> + int ret = 0; > >> + int segs_to_unmap = 0; > >> > >> xen_blkif_get(blkif); > >> > >> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg) > >> print_stats(blkif); > >> } > >> > >> + /* Free all persistent grant pages */ > >> + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) { > > > > Just for sanity - you did check this with blkfronts that did not have > > persistent grants enabled, right? > > Yes, it doesn't crash, but looking at foreach_grant it seems like it > should. I've added a check before trying to iterate over the tree. > > > > >> + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE); > >> + gnttab_set_unmap_op(&unmap[segs_to_unmap], > >> + (unsigned long) pfn_to_kaddr(page_to_pfn( > >> + persistent_gnt->page)), > >> + GNTMAP_host_map, > >> + persistent_gnt->handle); > >> + > >> + pages[segs_to_unmap] = persistent_gnt->page; > >> + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts); > >> + kfree(persistent_gnt); > >> + blkif->persistent_gnt_c--; > >> + > >> + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST || > >> + !rb_next(&persistent_gnt->node)) { > >> + ret = gnttab_unmap_refs(unmap, NULL, pages, > >> + segs_to_unmap); > >> + BUG_ON(ret); > >> + segs_to_unmap = 0; > >> + } > >> + } > >> + > >> + BUG_ON(blkif->persistent_gnt_c != 0); > >> + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts)); > >> + > >> if (log_stats) > >> print_stats(blkif); > >> > >> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req) > >> int ret; > >> > >> for (i = 0; i < req->nr_pages; i++) { > >> + if (!req->unmap_seg[i]) > >> + continue; > > > > Perhaps there should be a #define for that array.. > > Do you mean something like: > > #define unmap(req, i) req->unmap_seg[i] I was thinking that you just check for req->unamp_seg[i] to have an non-zero value. But since that array is just used as an check to see whether the functionality is enabled (or not), you might want to declerare the right values so: #define UNMAP_SG_ON 1 #define UNMAP_SG_OFF 0 or so. > > >> handle = pending_handle(req, i); > >> if (handle == BLKBACK_INVALID_HANDLE) > >> continue; > >> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req) > >> > >> static int xen_blkbk_map(struct blkif_request *req, > >> struct pending_req *pending_req, > >> - struct seg_buf seg[]) > >> + struct seg_buf seg[], > >> + struct page *pages[]) > >> { > >> struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> - int i; > >> + struct persistent_gnt > >> *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct persistent_gnt *persistent_gnt = NULL; > >> + struct xen_blkif *blkif = pending_req->blkif; > >> + phys_addr_t addr = 0; > >> + int i, j; > >> + int new_map; > > > > Just use a bool for this. > > Done > > >> int nseg = req->u.rw.nr_segments; > >> + int segs_to_map = 0; > >> int ret = 0; > >> + int use_persistent_gnts; > >> + > >> + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent); > >> + > >> + BUG_ON(blkif->persistent_gnt_c > > >> + max_mapped_grant_pages(pending_req->blkif->blk_protocol)); > >> > >> /* > >> * Fill out preq.nr_sects with proper amount of sectors, and setup > >> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req, > >> for (i = 0; i < nseg; i++) { > >> uint32_t flags; > >> > >> - flags = GNTMAP_host_map; > >> - if (pending_req->operation != BLKIF_OP_READ) > >> - flags |= GNTMAP_readonly; > >> - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags, > >> - req->u.rw.seg[i].gref, > >> - pending_req->blkif->domid); > >> + if (use_persistent_gnts) > >> + persistent_gnt = get_persistent_gnt( > >> + &blkif->persistent_gnts, > >> + req->u.rw.seg[i].gref); > >> + > >> + if (persistent_gnt) { > >> + /* > >> + * We are using persistent grants and > >> + * the grant is already mapped > >> + */ > >> + new_map = 0; > >> + } else if (use_persistent_gnts && > >> + blkif->persistent_gnt_c < > >> + max_mapped_grant_pages(blkif->blk_protocol)) { > >> + /* > >> + * We are using persistent grants, the grant is > >> + * not mapped but we have room for it > >> + */ > >> + new_map = 1; > >> + persistent_gnt = kzalloc( > >> + sizeof(struct persistent_gnt), > >> + GFP_KERNEL); > >> + if (!persistent_gnt) > >> + return -ENOMEM; > >> + persistent_gnt->page = alloc_page(GFP_KERNEL); > >> + if (!persistent_gnt->page) { > >> + kfree(persistent_gnt); > >> + return -ENOMEM; > >> + } > >> + persistent_gnt->gnt = req->u.rw.seg[i].gref; > >> + > >> + pages_to_gnt[segs_to_map] = > >> + persistent_gnt->page; > >> + addr = (unsigned long) pfn_to_kaddr( > >> + page_to_pfn(persistent_gnt->page)); > >> + > >> + add_persistent_gnt(&blkif->persistent_gnts, > >> + persistent_gnt); > >> + blkif->persistent_gnt_c++; > >> + pr_debug(DRV_PFX " grant %u added to the tree of > >> persistent grants, using %u/%u\n", > >> + persistent_gnt->gnt, > >> blkif->persistent_gnt_c, > >> + > >> max_mapped_grant_pages(blkif->blk_protocol)); > >> + } else { > >> + /* > >> + * We are either using persistent grants and > >> + * hit the maximum limit of grants mapped, > >> + * or we are not using persistent grants. > >> + */ > >> + if (use_persistent_gnts && > >> + !blkif->vbd.overflow_max_grants) { > >> + blkif->vbd.overflow_max_grants = 1; > >> + pr_alert(DRV_PFX " domain %u, device %#x is > >> using maximum number of persistent grants\n", > >> + blkif->domid, blkif->vbd.handle); > >> + } > >> + new_map = 1; > >> + pages[i] = blkbk->pending_page(pending_req, i); > >> + addr = vaddr(pending_req, i); > >> + pages_to_gnt[segs_to_map] = > >> + blkbk->pending_page(pending_req, i); > >> + } > >> + > >> + if (persistent_gnt) { > >> + pages[i] = persistent_gnt->page; > >> + persistent_gnts[i] = persistent_gnt; > >> + } else { > >> + persistent_gnts[i] = NULL; > >> + } > >> + > >> + if (new_map) { > >> + flags = GNTMAP_host_map; > >> + if (!persistent_gnt && > >> + (pending_req->operation != BLKIF_OP_READ)) > >> + flags |= GNTMAP_readonly; > >> + gnttab_set_map_op(&map[segs_to_map++], addr, > >> + flags, req->u.rw.seg[i].gref, > >> + blkif->domid); > >> + } > >> } > >> > >> - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, > >> 0), nseg); > >> - BUG_ON(ret); > >> + if (segs_to_map) { > >> + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map); > >> + BUG_ON(ret); > >> + } > >> > >> /* > >> * Now swizzle the MFN in our domain with the MFN from the other > >> domain > >> * so that when we access vaddr(pending_req,i) it has the contents of > >> * the page from the other domain. > >> */ > >> - for (i = 0; i < nseg; i++) { > >> - if (unlikely(map[i].status != 0)) { > >> - pr_debug(DRV_PFX "invalid buffer -- could not remap > >> it\n"); > >> - map[i].handle = BLKBACK_INVALID_HANDLE; > >> - ret |= 1; > >> + for (i = 0, j = 0; i < nseg; i++) { > >> + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) { > >> + /* This is a newly mapped grant */ > >> + BUG_ON(j >= segs_to_map); > >> + if (unlikely(map[j].status != 0)) { > > > > I am not seeing j being incremented anywhere? Should it? > > Yes, it should be incremented, but not here. See the comment below to > see what I've changed. > > > > >> + pr_debug(DRV_PFX "invalid buffer -- could > >> not remap it\n"); > >> + map[j].handle = BLKBACK_INVALID_HANDLE; > >> + ret |= 1; > >> + if (persistent_gnts[i]) { > >> + rb_erase(&persistent_gnts[i]->node, > >> + &blkif->persistent_gnts); > >> + blkif->persistent_gnt_c--; > >> + kfree(persistent_gnts[i]); > >> + persistent_gnts[i] = NULL; > >> + } > >> + } > >> + } > >> + if (persistent_gnts[i]) { > >> + if (!persistent_gnts[i]->handle) { > >> + /* > >> + * If this is a new persistent grant > >> + * save the handler > >> + */ > >> + persistent_gnts[i]->handle = map[j].handle; > >> + persistent_gnts[i]->dev_bus_addr = > >> + map[j++].dev_bus_addr; > >> + } > >> + pending_handle(pending_req, i) = > >> + persistent_gnts[i]->handle; > >> + pending_req->unmap_seg[i] = 0; > > > > Could we have a #define for that? > > Sure. > > >> + > >> + if (ret) > >> + continue; > > This should be > > if (ret) { > j++; > continue; > } <nods> > > >> + > >> + seg[i].buf = persistent_gnts[i]->dev_bus_addr | > >> + (req->u.rw.seg[i].first_sect << 9); > >> + } else { > >> + pending_handle(pending_req, i) = map[j].handle; > >> + pending_req->unmap_seg[i] = 1; > > > > And here as well? > > Done. > > >> + > >> + if (ret) > >> + continue; > >> + > >> + seg[i].buf = map[j++].dev_bus_addr | > >> + (req->u.rw.seg[i].first_sect << 9); > >> } > >> - > >> - pending_handle(pending_req, i) = map[i].handle; > >> - > >> - if (ret) > >> - continue; > >> - > >> - seg[i].buf = map[i].dev_bus_addr | > >> - (req->u.rw.seg[i].first_sect << 9); > >> } > >> return ret; > >> } > >> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif > >> *blkif, > >> int operation; > >> struct blk_plug plug; > >> bool drain = false; > >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> > >> switch (req->operation) { > >> case BLKIF_OP_READ: > >> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif > >> *blkif, > >> * the hypercall to unmap the grants - that is all done in > >> * xen_blkbk_unmap. > >> */ > >> - if (xen_blkbk_map(req, pending_req, seg)) > >> + if (xen_blkbk_map(req, pending_req, seg, pages)) > >> goto fail_flush; > >> > >> /* > >> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif > >> *blkif, > >> for (i = 0; i < nseg; i++) { > >> while ((bio == NULL) || > >> (bio_add_page(bio, > >> - blkbk->pending_page(pending_req, i), > >> + pages[i], > >> seg[i].nsec << 9, > >> seg[i].buf & ~PAGE_MASK) == 0)) { > >> > >> diff --git a/drivers/block/xen-blkback/common.h > >> b/drivers/block/xen-blkback/common.h > >> index 9ad3b5e..6c08ee9 100644 > >> --- a/drivers/block/xen-blkback/common.h > >> +++ b/drivers/block/xen-blkback/common.h > >> @@ -34,6 +34,7 @@ > >> #include <linux/vmalloc.h> > >> #include <linux/wait.h> > >> #include <linux/io.h> > >> +#include <linux/rbtree.h> > >> #include <asm/setup.h> > >> #include <asm/pgalloc.h> > >> #include <asm/hypervisor.h> > >> @@ -160,10 +161,22 @@ struct xen_vbd { > >> sector_t size; > >> bool flush_support; > >> bool discard_secure; > >> + > >> + unsigned int feature_gnt_persistent:1; > >> + unsigned int overflow_max_grants:1; > > > > I think the v3.7-rc1 has this structure changed just a tiny bit, so you > > might want to rebase it on top of that. > > I've done the rebase on top of your linux-next branch, commit > ad502612c173fff239250c9fe9bdfaaef70b9901. Thx > > > > >> }; > >> > >> struct backend_info; > >> > >> + > >> +struct persistent_gnt { > >> + struct page *page; > >> + grant_ref_t gnt; > >> + grant_handle_t handle; > >> + uint64_t dev_bus_addr; > >> + struct rb_node node; > >> +}; > >> + > >> struct xen_blkif { > >> /* Unique identifier for this interface. */ > >> domid_t domid; > >> @@ -190,6 +203,10 @@ struct xen_blkif { > >> struct task_struct *xenblkd; > >> unsigned int waiting_reqs; > >> > >> + /* frontend feature information */ > > > > Huh? > > Changed it to: > > /* tree to store persistent grants */ > > >> + struct rb_root persistent_gnts; > >> + unsigned int persistent_gnt_c; > >> + > >> /* statistics */ > >> unsigned long st_print; > >> int st_rd_req; > >> diff --git a/drivers/block/xen-blkback/xenbus.c > >> b/drivers/block/xen-blkback/xenbus.c > >> index 4f66171..9f88b4e 100644 > >> --- a/drivers/block/xen-blkback/xenbus.c > >> +++ b/drivers/block/xen-blkback/xenbus.c > >> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid) > >> atomic_set(&blkif->drain, 0); > >> blkif->st_print = jiffies; > >> init_waitqueue_head(&blkif->waiting_to_free); > >> + blkif->persistent_gnts.rb_node = NULL; > >> > >> return blkif; > >> } > >> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be) > >> struct xenbus_device *dev = be->dev; > >> unsigned long ring_ref; > >> unsigned int evtchn; > >> + unsigned int pers_grants; > >> char protocol[64] = ""; > >> int err; > >> > >> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be) > >> xenbus_dev_fatal(dev, err, "unknown fe protocol %s", > >> protocol); > >> return -1; > >> } > >> - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n", > >> - ring_ref, evtchn, be->blkif->blk_protocol, protocol); > >> + err = xenbus_gather(XBT_NIL, dev->otherend, > >> + "feature-persistent-grants", "%u", > >> + &pers_grants, NULL); > >> + if (err) > >> + pers_grants = 0; > >> + > >> + be->blkif->vbd.feature_gnt_persistent = pers_grants; > >> + be->blkif->vbd.overflow_max_grants = 0; > >> + > >> + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) > >> persistent %d\n", > >> + ring_ref, evtchn, be->blkif->blk_protocol, protocol, > >> + pers_grants); > > > > Can you make that a string? So it is > > pers_grants ? "persistent grants" : "" > > > > instead of a zero of one value pls? > > Yes, done. > > >> > >> /* Map the shared frame, irq etc. */ > >> err = xen_blkif_map(be->blkif, ring_ref, evtchn); > >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > >> index 2c2d2e5..206d422 100644 > >> --- a/drivers/block/xen-blkfront.c > >> +++ b/drivers/block/xen-blkfront.c > >> @@ -44,6 +44,7 @@ > >> #include <linux/mutex.h> > >> #include <linux/scatterlist.h> > >> #include <linux/bitmap.h> > >> +#include <linux/llist.h> > >> > >> #include <xen/xen.h> > >> #include <xen/xenbus.h> > >> @@ -64,10 +65,17 @@ enum blkif_state { > >> BLKIF_STATE_SUSPENDED, > >> }; > >> > >> +struct grant { > >> + grant_ref_t gref; > >> + unsigned long pfn; > >> + struct llist_node node; > >> +}; > >> + > >> struct blk_shadow { > >> struct blkif_request req; > >> struct request *request; > >> unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > >> }; > >> > >> static DEFINE_MUTEX(blkfront_mutex); > >> @@ -97,6 +105,8 @@ struct blkfront_info > >> struct work_struct work; > >> struct gnttab_free_callback callback; > >> struct blk_shadow shadow[BLK_RING_SIZE]; > >> + struct llist_head persistent_gnts; > >> + unsigned int persistent_gnts_c; > >> unsigned long shadow_free; > >> unsigned int feature_flush; > >> unsigned int flush_op; > >> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req) > >> unsigned long id; > >> unsigned int fsect, lsect; > >> int i, ref; > >> + > >> + /* > >> + * Used to store if we are able to queue the request by just using > >> + * existing persistent grants (0), or if we have to get new grants, > > > > What does the zero mean? > > Frankly, no idea, I guess it was in Oliver's patch and I missed to spot it. > > >> + * as there are not sufficiently many free. > >> + */ > >> + int new_persistent_gnts; > > > > I think this can be a bool? > > I agree. > > >> grant_ref_t gref_head; > >> + struct page *granted_page; > >> + struct grant *gnt_list_entry = NULL; > >> struct scatterlist *sg; > >> > >> if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) > >> return 1; > >> > >> - if (gnttab_alloc_grant_references( > >> - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) { > >> - gnttab_request_free_callback( > >> - &info->callback, > >> - blkif_restart_queue_callback, > >> - info, > >> - BLKIF_MAX_SEGMENTS_PER_REQUEST); > >> - return 1; > >> - } > >> + /* Check if we have enought grants to allocate a requests */ > >> + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) { > >> + new_persistent_gnts = 1; > >> + if (gnttab_alloc_grant_references( > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c, > >> + &gref_head) < 0) { > >> + gnttab_request_free_callback( > >> + &info->callback, > >> + blkif_restart_queue_callback, > >> + info, > >> + BLKIF_MAX_SEGMENTS_PER_REQUEST); > >> + return 1; > >> + } > >> + } else > >> + new_persistent_gnts = 0; > >> > >> /* Fill out a communications ring structure. */ > >> ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); > >> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req) > >> BLKIF_MAX_SEGMENTS_PER_REQUEST); > >> > >> for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) { > >> - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg))); > >> fsect = sg->offset >> 9; > >> lsect = fsect + (sg->length >> 9) - 1; > >> - /* install a grant reference. */ > >> - ref = gnttab_claim_grant_reference(&gref_head); > >> - BUG_ON(ref == -ENOSPC); > >> > >> - gnttab_grant_foreign_access_ref( > >> - ref, > >> + if (info->persistent_gnts_c) { > >> + BUG_ON(llist_empty(&info->persistent_gnts)); > >> + gnt_list_entry = llist_entry( > >> + > >> llist_del_first(&info->persistent_gnts), > >> + struct grant, node); > >> + > >> + ref = gnt_list_entry->gref; > >> + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn); > >> + info->persistent_gnts_c--; > >> + } else { > >> + ref = > >> gnttab_claim_grant_reference(&gref_head); > >> + BUG_ON(ref == -ENOSPC); > >> + > >> + gnt_list_entry = > >> + kmalloc(sizeof(struct grant), > >> + GFP_ATOMIC); > >> + if (!gnt_list_entry) > >> + return -ENOMEM; > >> + > >> + granted_page = alloc_page(GFP_ATOMIC); > >> + if (!granted_page) { > >> + kfree(gnt_list_entry); > >> + return -ENOMEM; > >> + } > >> + > >> + gnt_list_entry->pfn = > >> + page_to_pfn(granted_page); > >> + gnt_list_entry->gref = ref; > >> + > >> + buffer_mfn = pfn_to_mfn(page_to_pfn( > >> + > >> granted_page)); > >> + gnttab_grant_foreign_access_ref(ref, > >> info->xbdev->otherend_id, > >> - buffer_mfn, > >> - rq_data_dir(req)); > >> + buffer_mfn, 0); > >> + } > >> + > >> + info->shadow[id].grants_used[i] = gnt_list_entry; > >> + > >> + if (rq_data_dir(req)) { > >> + char *bvec_data; > >> + void *shared_data; > >> + > >> + BUG_ON(sg->offset + sg->length > PAGE_SIZE); > >> + > >> + shared_data = kmap_atomic( > >> + pfn_to_page(gnt_list_entry->pfn)); > >> + bvec_data = kmap_atomic(sg_page(sg)); > >> + > >> + /* > >> + * this does not wipe data stored outside the > >> + * range sg->offset..sg->offset+sg->length. > >> + * Therefore, blkback *could* see data from > >> + * previous requests. This is OK as long as > >> + * persistent grants are shared with just one > >> + * domain. It may need refactoring if This > > .. this (lowercase it pls) > > Done. > > > > >> + * changes > >> + */ > >> + memcpy(shared_data + sg->offset, > >> + bvec_data + sg->offset, > >> + sg->length); > >> + > >> + kunmap_atomic(bvec_data); > >> + kunmap_atomic(shared_data); > >> + } > >> > >> info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn); > >> ring_req->u.rw.seg[i] = > >> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req) > >> /* Keep a private copy so we can reissue requests when recovering. */ > >> info->shadow[id].req = *ring_req; > >> > >> - gnttab_free_grant_references(gref_head); > >> + if (new_persistent_gnts) > >> + gnttab_free_grant_references(gref_head); > >> > >> return 0; > >> } > >> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, > >> u16 sector_size) > >> static void xlvbd_flush(struct blkfront_info *info) > >> { > >> blk_queue_flush(info->rq, info->feature_flush); > >> - printk(KERN_INFO "blkfront: %s: %s: %s\n", > >> + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n", > > > > HA! By default, eh? > > Yes, you caught me, there's a paragraph in the commit message that > explains that we are using persistent grants in the frontend > unconditionally, since the protocol is compatible (you can have a > persistent blkfront and a non-persistent blkback). It simplifies the > logic in blkfront. Are you OK with it? It is OK, but you should be checking whether the backend supports it. I don't see it checking the info->feature_persistent_grant to print that. > > >> info->gd->disk_name, > >> info->flush_op == BLKIF_OP_WRITE_BARRIER ? > >> "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ? > >> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct > >> *work) > >> > >> static void blkif_free(struct blkfront_info *info, int suspend) > >> { > >> + struct llist_node *all_gnts; > >> + struct grant *persistent_gnt; > >> + > >> /* Prevent new requests being issued until we fix things up. */ > >> spin_lock_irq(&info->io_lock); > >> info->connected = suspend ? > >> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, > >> int suspend) > >> /* No more blkif_request(). */ > >> if (info->rq) > >> blk_stop_queue(info->rq); > >> + > >> + /* Remove all persistent grants */ > >> + if (info->persistent_gnts_c) { > >> + all_gnts = llist_del_all(&info->persistent_gnts); > >> + llist_for_each_entry(persistent_gnt, all_gnts, node) { > >> + gnttab_end_foreign_access(persistent_gnt->gref, 0, > >> 0UL); > >> + kfree(persistent_gnt); > >> + } > >> + info->persistent_gnts_c = 0; > >> + } > >> + > >> /* No more gnttab callback work. */ > >> gnttab_cancel_free_callback(&info->callback); > >> spin_unlock_irq(&info->io_lock); > >> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, > >> int suspend) > >> > >> } > >> > >> -static void blkif_completion(struct blk_shadow *s) > >> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info > >> *info, > >> + struct blkif_response *bret) > >> { > >> int i; > >> - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place > >> - * flag. */ > >> - for (i = 0; i < s->req.u.rw.nr_segments; i++) > >> - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL); > >> + struct bio_vec *bvec; > >> + struct req_iterator iter; > >> + unsigned long flags; > >> + char *bvec_data; > >> + void *shared_data; > >> + unsigned int offset = 0; > >> + > >> + if (bret->operation == BLKIF_OP_READ) { > >> + /* > >> + * Copy the data received from the backend into the bvec. > >> + * Since bv_len can be different from PAGE_SIZE, we need to > >> + * be sure we are actually copying the data from the right > >> + * shared page. > >> + */ > >> + rq_for_each_segment(bvec, s->request, iter) { > >> + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE); > >> + i = offset >> PAGE_SHIFT; > > > > Could you also include a comment about the bug you found here, pls? > > There's a comment before the rq_for_each_segment loop, that tries to > explain that, do you think the following is better? > > /* > * Copy the data received from the backend into the bvec. > * Since bv_offset can be different than 0, and bv_len different > * than PAGE_SIZE, we have to keep track of the current offset, > * to be sure we are copying the data from the right shared page. Yes. That is good. > */ > > >> + shared_data = kmap_atomic( > >> + pfn_to_page(s->grants_used[i]->pfn)); > >> + bvec_data = bvec_kmap_irq(bvec, &flags); > >> + memcpy(bvec_data, shared_data + bvec->bv_offset, > >> + bvec->bv_len); > >> + bvec_kunmap_irq(bvec_data, &flags); > >> + kunmap_atomic(shared_data); > >> + offset += bvec->bv_len; > >> + } > >> + } > >> + /* Add the persistent grant into the list of free grants */ > >> + for (i = 0; i < s->req.u.rw.nr_segments; i++) { > >> + llist_add(&s->grants_used[i]->node, &info->persistent_gnts); > >> + info->persistent_gnts_c++; > >> + } > >> } > >> > >> static irqreturn_t blkif_interrupt(int irq, void *dev_id) > >> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void > >> *dev_id) > >> req = info->shadow[id].request; > >> > >> if (bret->operation != BLKIF_OP_DISCARD) > >> - blkif_completion(&info->shadow[id]); > >> + blkif_completion(&info->shadow[id], info, bret); > >> > >> if (add_id_to_freelist(info, id)) { > >> WARN(1, "%s: response to %s (id %ld) couldn't be > >> recycled!\n", > >> @@ -942,6 +1066,11 @@ again: > >> message = "writing protocol"; > >> goto abort_transaction; > >> } > >> + err = xenbus_printf(xbt, dev->nodename, > >> + "feature-persistent-grants", "%d", 1); > > > > So its %u in blkback, but %d in here? Which one should it be? > > %u in both places. > > >> + if (err) > >> + dev_warn(&dev->dev, > >> + "writing persistent grants feature to xenbus"); > >> > >> err = xenbus_transaction_end(xbt, 0); > >> if (err) { > >> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev, > >> spin_lock_init(&info->io_lock); > >> info->xbdev = dev; > >> info->vdevice = vdevice; > >> + init_llist_head(&info->persistent_gnts); > >> + info->persistent_gnts_c = 0; > >> info->connected = BLKIF_STATE_DISCONNECTED; > >> INIT_WORK(&info->work, blkif_restart_queue); > >> > >> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info) > >> req->u.rw.seg[j].gref, > >> info->xbdev->otherend_id, > >> > >> pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]), > >> - > >> rq_data_dir(info->shadow[req->u.rw.id].request)); > >> + 0); > >> } > >> info->shadow[req->u.rw.id].req = *req; > >> > >> -- > >> 1.7.7.5 (Apple Git-26) > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> Please read the FAQ at http://www.tux.org/lkml/ > >> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.