[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v10 06/25] tools/xenstore: add live update command to xenstore-control
On Tue, 2020-12-15 at 17:35 +0100, Juergen Gross wrote: > [...] > +static int live_update_start(struct xs_handle *xsh, bool force, > unsigned int to) > +{ > + int len = 0; > + char *buf = NULL, *ret; > + time_t time_start; > + > + if (asprintf(&ret, "%u", to) < 0) > + return 1; > + len = add_to_buf(&buf, "-s", len); > + len = add_to_buf(&buf, "-t", len); > + len = add_to_buf(&buf, ret, len); > + free(ret); > + if (force) > + len = add_to_buf(&buf, "-F", len); > + if (len < 0) > + return 1; > + > + for (time_start = time(NULL); time(NULL) - time_start < to;) { > + ret = xs_control_command(xsh, "live-update", buf, len); > + if (!ret) > + goto err; > + if (strcmp(ret, "BUSY")) > + break; > + } We've discovered issues with this during testing: when a live-update is not possible (e.g. guest with active transaction held open on purpose) xenstored gets flooded with live-update requests until the timeout is reached. This is problematic for multiple reasons: * flooding xenstored reduces its throughput, and makes it use 100% CPU. Depending on the implementation and configuration it may also flood the logs (which isn't fatal, the system still stays alive if you ENOSPC on /var/log, but it makes it difficult to debug issues if a live update gets you to ENOSPC as you may not see actual failures from after the live update). * flooding xenstored causes the evtchn to overflow and AFAICT neither xenstored or oxenstored is capable of coping with that (oxenstored infinite loops when that happens). IIUC this needs to be fixed in the kernel, so it doesn't return EFBIG in the first place. As a workaround I added a sleep(1) in this loop * the timeout is hit on both client and server sides, but depending on rounding almost always happens on the client side first which prevents us from displaying a more informative error message from the server. As a workaround I increased the client side timeout by 2s to cope with rounding and give the server a chance to reply. Oxenstored can reply with a list of domains preventing the live-update for example. My workarounds looked like this: @@ -42,6 +43,9 @@ static int live_update_start(struct xs_handle *xsh, bool force, unsigned int to) len = add_to_buf(&buf, "-F", len); if (len < 0) return 1; + /* +1 for rounding issues + * +1 to give oxenstored a chance to timeout and report back first */ + to += 2; for (time_start = time(NULL); time(NULL) - time_start < to;) { ret = xs_control_command(xsh, "live-update", buf, len); @@ -49,6 +53,12 @@ static int live_update_start(struct xs_handle *xsh, bool force, unsigned int to) goto err; if (strcmp(ret, "BUSY")) break; + /* TODO: use task ID for commands, avoid busy loop polling here + * oxenstored checks BUSY condition internally on every main loop iteration anyway. + * Avoid flooding xenstored with live-update requests. + * The flooding can also cause the evtchn to overflow in xenstored which makes + * xenstored enter an infinite loop */ + sleep(1); } This also required various heuristics in oxenstored to differentiate between a new live-update command and querying the status of an already completed live-update command, otherwise I almost always ended up doing 2 live-updates in a row (first one queued up, returned BUSY, completed after a while, and then another one from all the repeated live-update requests). I'd prefer it if there was a more asynchronous protocol available for live-update: * the live-update on its own queues up the live update request and returns a generation ID. xenstored retries on its own during each of its main loop iterations whether the conditions for live-update are now met * when a generation ID is specified with the live-update command it acts as a query to return the status. A query for generation ID < current ID return success, and for generation ID = current ID it returns either BUSY, or a specific error if known (e.g. timeout reached and list of domains preventing live update) * the generation ID gets incremented every time a live update succeeds This would eliminiate the need for a client-side timeout, since each of these commands would be non-blocking. It'd also avoid the busy-poll flood. Obviously any change here has to be backwards compatible with the already deployed xenstored and oxenstored which doesn't know about generation IDs, but you can tell them apart based on the reply: if you get back OK/BUSY/some error it is the old version, if you get back a generation ID it is the new version. I ran out of time to implement this before the embargo was up, should I go ahead with prototyping a patch for this now, or do you see an alternative way to make the live-update command more robust? Best regards, --Edwin
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |