[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [OSSTEST PATCH 17/33] ms-ownerdaemon: Cope with db restart. Retry recording dead tasks.



In chan-destroy-stuff, instead of accessing the db directly, add the
dead task(s) to a queue, and arrange to look at that queue.

Errors are handled by setting an `after' handler which we cancel if we
are successful.

The after handler requeues a queue run attempt as the first thing
(which will arrange that a further retry will occur if things are
still broken) and then attempts to reconnect to the database.

I have tested this with a test instance by renaming the `tasks' table
under its feet, and it functions as expected.

DEPLOYMENT NOTE: The owner daemon cannot be restarted without shutting
everything down.  So this update should first be deployed in
Cambridge, probably, to see how it goes.  Also, it is less critical in
the main Xen production test lab because there the db and the owner
daemon are co-hosted on the same VM.

Signed-off-by: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Acked-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
---
v2: Put back the `unset tasks' which was mistakenly removed.  The
    effect of its lack is to fail to clear out the task list for
    previous uses of the channel (which is named after the fd); this
    is mostly harmless apart from log spam but causes the usual
    case to be something like
       OK created-task 456354 ownd [10.80.227.94]:44852-876
    rather than
       OK created-task 456354 ownd [10.80.227.94]:44852-876
    which some of the clients (rightly) don't expect.
---
 Osstest/Executive.pm |  1 +
 ms-ownerdaemon       | 38 ++++++++++++++++++++++++++++++++++----
 2 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/Osstest/Executive.pm b/Osstest/Executive.pm
index 468031c..0602925 100644
--- a/Osstest/Executive.pm
+++ b/Osstest/Executive.pm
@@ -113,6 +113,7 @@ augmentconfigdefaults(
 augmentconfigdefaults(
     OwnerDaemonHost => $c{ControlDaemonHost},
     QueueDaemonHost => $c{ControlDaemonHost},
+    OwnerDaemonDbRetry => $c{QueueDaemonRetry},
 );
 
 #---------- configuration reader etc. ----------
diff --git a/ms-ownerdaemon b/ms-ownerdaemon
index 3623d19..62ca645 100755
--- a/ms-ownerdaemon
+++ b/ms-ownerdaemon
@@ -22,16 +22,38 @@
 source ./tcl/daemonlib.tcl
 
 
+set dead_tasks {}
+
 proc chan-destroy-stuff {chan} {
+    global dead_tasks
+
     upvar #0 chanawait($chan) await
     catch { unset await }
 
     upvar #0 chantasks($chan) tasks
     if {![info exists tasks]} return
 
+    puts-chan-desc $chan "-- $tasks"
+
+    foreach task $tasks {
+       lappend dead_tasks $task
+    }
+    unset tasks
+    after idle record-dead-tasks
+}
+
+proc record-dead-tasks {} {
+    global c dead_tasks
+
+    if {![llength $dead_tasks]} return
+
+    puts "record-dead-tasks ... $dead_tasks"
+
+    set retry [expr {$c(OwnerDaemonDbRetry) * 1000}]
+    set eafter [after $retry record-dead-tasks-retry]
+
     jobdb::transaction resources {
-        puts-chan-desc $chan "-- $tasks"
-        foreach task $tasks {
+        foreach task $dead_tasks {
             jobdb::db-execute "
                 UPDATE tasks
                    SET live = 'f'
@@ -39,12 +61,20 @@ proc chan-destroy-stuff {chan} {
             "
         }
     }
-    puts-chan-desc $chan "== $tasks"
-    unset tasks
 
+    after cancel $eafter
+    puts "record-dead-tasks OK. $dead_tasks"
+    set dead_tasks {}
     after idle await-endings-notify
 }
 
+proc record-dead-tasks-retry {} {
+    after idle record-dead-tasks
+    puts "** reconnecting/retrying **"
+    catch { jobdb::db-close }
+    jobdb::db-open
+}
+
 proc await-endings-notify {} {
     global chanawait
     foreach chan [array names chanawait] {
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.