Xen project Mailing List

Re: [Xen-devel] [PATCH 3/3] xen: optimize xenbus driver for multiple concurrent xenstore accesses

To: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx

From: Juergen Gross <jgross@xxxxxxxx>

Date: Tue, 10 Jan 2017 07:18:35 +0100

Delivery-date: Tue, 10 Jan 2017 06:19:08 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 09/01/17 22:17, Boris Ostrovsky wrote: > On 01/06/2017 10:05 AM, Juergen Gross wrote: >> Handling of multiple concurrent Xenstore accesses through xenbus driver >> either from the kernel or user land is rather lame today: xenbus is >> capable to have one access active only at one point of time. >> >> Rewrite xenbus to handle multiple requests concurrently by making use >> of the request id of the Xenstore protocol. This requires to: >> >> - Instead of blocking inside xb_read() when trying to read data from >> the xenstore ring buffer do so only in the main loop of >> xenbus_thread(). >> >> - Instead of doing writes to the xenstore ring buffer in the context of >> the caller just queue the request and do the write in the dedicated >> xenbus thread. >> >> - Instead of just forwarding the request id specified by the caller of >> xenbus to xenstore use a xenbus internal unique request id. This will >> allow multiple outstanding requests. >> >> - Modify the locking scheme in order to allow multiple requests being >> active in parallel. >> >> - Instead of waiting for the reply of a user's xenstore request after >> writing the request to the xenstore ring buffer return directly to >> the caller and do the waiting in the read path. >> >> Additionally signal handling was optimized by avoiding waking up the >> xenbus thread or sending an event to Xenstore in case the addressed >> entity is known to be running already. >> >> As a result communication with Xenstore is sped up by a factor of up >> to 5: depending on the request type (read or write) and the amount of >> data transferred the gain was at least 20% (small reads) and went up to >> a factor of 5 for large writes. >> >> In the end some more rough edges of xenbus have been smoothed: >> >> - Handling of memory shortage when reading from xenstore ring buffer in >> the xenbus driver was not optimal: it was busy looping and issuing a >> warning in each loop. >> >> - In case of xenstore not running in dom0 but in a stubdom we end up >> with two xenbus threads running as the initialization of xenbus in >> dom0 expecting a local xenstored will be redone later when connecting >> to the xenstore domain. Up to now this was no problem as locking >> would prevent the two xenbus threads interfering with each other, but >> this was just a waste of kernel resources. >> >> - An out of memory situation while writing to or reading from the >> xenstore ring buffer no longer will lead to a possible loss of >> synchronization with xenstore. >> >> - The user read and write part are now interruptible by signals. >> >> Signed-off-by: Juergen Gross <jgross@xxxxxxxx> >> --- >> I'm aware that the changes are quite large. I thought about sending a >> version split into multiple patches, but a lot of lines would have been >> touched by more than one patch. I still have the multiple patch variant >> lying around - this patch is split into 11 smaller ones. While all >> steps of this larger series is operational some steps are not optimal >> as they are even slower than the original version of xenbus. >> >> Nevertheless I can send the large series if there are requests for it. > > I will comment only on xen_comms changes for now since otherwise I am > afraid it may be difficult to keep track of conversation. Okay. >> diff --git a/drivers/xen/xenbus/xenbus_comms.c >> b/drivers/xen/xenbus/xenbus_comms.c >> index c21ec02..fa054ca 100644 >> --- a/drivers/xen/xenbus/xenbus_comms.c >> +++ b/drivers/xen/xenbus/xenbus_comms.c >> @@ -34,6 +34,7 @@ >> >> #include <linux/wait.h> >> #include <linux/interrupt.h> >> +#include <linux/kthread.h> >> #include <linux/sched.h> >> #include <linux/err.h> >> #include <xen/xenbus.h> >> @@ -42,11 +43,40 @@ >> #include <xen/page.h> >> #include "xenbus.h" >> >> +struct xs_thread_state_write { >> + struct xb_req_data *req; >> + int idx; >> + unsigned int used; > > "written" or "sent"? I don't mind. >> +}; >> + >> +struct xs_thread_state_read { >> + struct xsd_sockmsg msg; >> + char *body; >> + union { >> + void *alloc; >> + struct xs_watch_event *watch; >> + }; >> + bool in_msg; >> + bool in_hdr; > > It may be better to keep track of which state we are in using a bitmap. > Otherwise it easy to lose track of one or the other. Hmm, really? It's rather easy: in_msg: are we processing any message? in_hdr: are we processing the message header (in_msg is true)? >> + unsigned int used; > > "read" or"received"? Sure, can change. >> +}; > > Both of these are private to process_msg/process_write so perhaps they > can be declared in those routines' scopes. I can do this if you want. >> /* Read indexes, then verify. */ >> cons = intf->req_cons; >> prod = intf->req_prod; >> @@ -115,59 +146,57 @@ int xb_write(const void *data, unsigned len) >> intf->req_cons = intf->req_prod = 0; >> return -EIO; >> } >> - >> - dst = get_output_chunk(cons, prod, intf->req, &avail); >> - if (avail == 0) >> - continue; >> - if (avail > len) >> - avail = len; >> + if (!xb_data_to_write()) >> + return bytes; >> >> /* Must write data /after/ reading the consumer index. */ >> virt_mb(); >> >> + dst = get_output_chunk(cons, prod, intf->req, &avail); >> + if (avail == 0) >> + continue; > > Should we continue the loop here or return? We are waiting for the > reader to get stuff off the ring. avail == 0 can happen only if the reader just modified req_cons between us reading it to cons and testing for free space via xb_data_to_write(). So the (local) retry should happen only very rarely and exactly once between writing any further bytes. We could return, of course, but then the retry would happen anyway via the main thread loop. >> +static int process_msg(void) >> +{ >> + static struct xs_thread_state_read state; >> + struct xb_req_data *req; >> + int err; >> + unsigned int len; >> + >> + if (!state.in_msg) { >> + state.in_msg = true; >> + state.in_hdr = true; >> + state.used = 0; >> + >> + /* >> + * We must disallow save/restore while reading a message. >> + * A partial read across s/r leaves us out of sync with >> + * xenstored. >> + */ >> + mutex_lock(&xs_response_mutex); >> + >> + if (!xb_data_to_read()) { >> + /* We raced with save/restore: pending data 'gone'. */ >> + mutex_unlock(&xs_response_mutex); >> + state.in_msg = false; >> + return 0; >> + } >> + } >> + >> + if (state.in_hdr) { >> + if (state.used != sizeof(state.msg)) { >> + err = xb_read((void *)&state.msg + state.used, >> + sizeof(state.msg) - state.used); >> + if (err < 0) >> + goto out; >> + state.used += err; >> + if (state.used != sizeof(state.msg)) >> + return 0; > > Would it be possible to do locking at the caller? I understand that you > are trying to hold the lock across multiple invocations of this function > but it feels somewhat counter-intuitive and bug-prone. I think that would be difficult. > If it's not possible then at least please add a comment explaining > locking algorithm. Okay. Something like: /* * xs_response_mutex is locked as long as we are processing one * message. state.in_msg will be true as long as we are holding the * lock in process_msg(). */ >> + if (state.msg.len > XENSTORE_PAYLOAD_MAX) { >> + err = -EINVAL; >> + goto out; >> + } >> + } >> + >> + len = state.msg.len + 1; >> + if (state.msg.type == XS_WATCH_EVENT) >> + len += sizeof(*state.watch); >> + >> + state.alloc = kmalloc(len, GFP_NOIO | __GFP_HIGH); > > Why can't you kmalloc to state.body only when type!=XS_WATCH_EVENT ? I need to read the watch data, too. >> + if (!state.alloc) >> + return -ENOMEM; >> + >> + if (state.msg.type == XS_WATCH_EVENT) >> + state.body = state.watch->body; >> + else >> + state.body = state.alloc; >> + state.in_hdr = false; >> + state.used = 0; >> + } >> +static int process_writes(void) >> +{ >> + static struct xs_thread_state_write state; >> + void *base; >> + unsigned int len; >> + int err = 0; >> + >> + if (!xb_data_to_write()) >> + return 0; >> + >> + mutex_lock(&xb_write_mutex); >> + >> + if (!state.req) { >> + state.req = list_first_entry(&xb_write_list, >> + struct xb_req_data, list); >> + state.idx = -1; >> + state.used = 0; >> + } >> + >> + if (state.req->state == xb_req_state_aborted) >> + goto out_err; >> + >> + while (state.idx < state.req->num_vecs) { >> + if (state.idx < 0) { >> + base = &state.req->msg; >> + len = sizeof(state.req->msg); >> + } else { >> + base = state.req->vec[state.idx].iov_base; >> + len = state.req->vec[state.idx].iov_len; >> + } >> + err = xb_write(base + state.used, len - state.used); >> + if (err < 0) >> + goto out_err; >> + state.used += err; >> + if (state.used != len) >> + goto out; >> + >> + state.idx++; >> + state.used = 0; >> + } >> + >> + /* >> + * You would expect the following to be racy, but as the response is >> + * being read by our thread there is no risk of req being freed >> + * under our feet. >> + */ > > I don't think I understand this (and it's missing a "so" or something > like that between "thread" and "there"). If this is not racy, why are we > doing this under xb_write_mutex? You are right. This was a problem in an intermediate stage of development, but now the freeing of req is done with xb_write_mutex held. I'll remove the comment. >> + list_del(&state.req->list); >> + state.req->state = xb_req_state_wait_reply; >> + list_add_tail(&state.req->list, &xs_reply_list); >> + state.req = NULL; >> + >> + out: >> + mutex_unlock(&xb_write_mutex); >> + >> + return 0; >> + >> + out_err: >> + state.req->msg.type = XS_ERROR; >> + state.req->err = err; > > You don't seem to need this for xb_req_state_aborted since you are > freeing state_req. OTOH, why shouldn't aborted requests generate an > error reply as well? They do. Before setting xb_req_state_aborted a possible error is taken from req (see xs_wait_for_reply()). In case of an early error returned (EIO in read_reply()) there is nobody waiting for (another) response. >> + list_del(&state.req->list); >> + if (state.req->state == xb_req_state_aborted) >> + kfree(state.req); >> + else { >> + state.req->state = xb_req_state_got_reply; >> + wake_up(&state.req->wq); >> + } >> + >> + mutex_unlock(&xb_write_mutex); >> + >> + state.req = NULL; >> + >> + return err; >> +} >> + >> +static int xb_thread_work(void) >> +{ >> + return xb_data_to_read() || xb_data_to_write(); >> +} >> + >> +static int xenbus_thread(void *unused) >> +{ >> + int err; >> + >> + while (!kthread_should_stop()) { >> + if (wait_event_interruptible(xb_waitq, xb_thread_work())) >> + continue; >> + >> + err = process_msg(); >> + if (err == -ENOMEM) >> + schedule(); >> + else if (err) >> + pr_warn("error %d while reading message\n", err); >> + >> + err = process_writes(); >> + if (err) >> + pr_warn("error %d while writing message\n", err); > > Is there a chance that errors are persistent and you then spam the log? Only in case xenstored is spamming the ring buffer with illegal data. I believe this is rather improbable and we are doomed in this case anyway. OTOH it doesn't hurt to switch to pr_warn_ratelimited(). > -boris Thanks for the comments! Juergen _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.