[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Issue policing writes from Xen to PV domain memory



>> Let me step through this. The pagefault for runstate occurs and
>wait_event()
>> gets called and the ring is full.
>>
>> wait_event():
>> do {
>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will
>return EBUSY
>>         break;
>>     for ( ; ; ) {
>>         prepare_to_wait(&med->wq);
>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )   ==> This will
>return EBUSY
>>             break;
>>         wait();                                                              
>>                                      ==> Write to
>runstate area occurs...
>>     }
>>     finish_wait(&med->wq);
>> } while (0)
>>
>> ...now this write to runstate will again cause a pagefault and we will end
>> up in wait_event() again. The previous attempt would have called
>> prepare_to_wait() but finish_wait() was not called.
>> finish_wait()->__finish_wait() is where wqv->esp gets reset to NULL. So
>now:
>>
>> wait_event():
>> do {
>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will
>return EBUSY
>>         break;
>>     for ( ; ; ) {
>>         prepare_to_wait(&med->wq);                                           
>>       ==> Assertion
>'wqv->esp == 0' fails
>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>             break;
>>         wait();
>>     }
>>     finish_wait(&med->wq);
>> } while (0)
>>
>> Does this make sense?
>
>Of course. But it still doesn't mean you can replace the wait() here
>with something else (e.g. the effectively spin wait loop you suggested
>earlier). As said earlier on, you need to find a way to avoid the second
>invocation of wait_event(), rather than a way to make it "work".

I took your advice and tried to stop the second invocation of wait_event() by 
using the method suggested by Andres which is to merge the recursive faults.

<quote>
1. read last unconsumed
2. compare
3. different? write into ring
4. same? atomic_read consumer index to ensure still unconsumed 
5. consumed? write into ring <still a possible race below, but not a tragedy> 
6. not consumed? drop…
<end quote>

Here is the gist of the patch. I have only included the pertinent part for now.

diff --git a/xen/arch/x86/mm/mem_event.c b/xen/arch/x86/mm/mem_event.c
index 9b45504..2b697d7 100644
--- a/xen/arch/x86/mm/mem_event.c
+++ b/xen/arch/x86/mm/mem_event.c
@@ -267,6 +267,49 @@ void mem_event_mark_and_pause(struct vcpu *v, struct 
mem_event_domain *med)
 }
 
 /*
+ * Check if duplicate requests are being placed in the ring. The check is not
+ * race free but the worst that can happen is the caller detecting an already
+ * consumed request as duplicate. This is safe for sh_page_fault() which is the
+ * only caller of this function.
+ */
+bool_t mem_event_check_duplicate(struct mem_event_domain *med,
+                                 mem_event_request_t *req)
+{
+    mem_event_front_ring_t *front_ring;
+    mem_event_request_t *ureq;
+    RING_IDX req_event;
+    bool_t rc = 0;
+
+    mem_event_ring_lock(med);
+
+    /* Due to the reservations, this step must succeed. */
+    front_ring = &med->front_ring;
+
+    /* Index of last unconsumed request */
+    req_event = front_ring->sring->req_event - 1;
+    ureq = RING_GET_REQUEST(front_ring, req_event);
+
+    if ( req->gfn != ureq->gfn )
+        goto out;
+    if ( current->vcpu_id != ureq->vcpu_id )
+        goto out;
+    if ( req->access_r != ureq->access_r )
+        goto out;
+    if ( req->access_w != ureq->access_w )
+        goto out;
+    if ( req->access_x != ureq->access_x )
+        goto out;
+
+    /* Check consumer index has not moved */
+    if ( req_event == read_atomic(&front_ring->sring->req_event) - 1 )
+        rc = 1;
+
+ out:
+    mem_event_ring_unlock(med);
+    return rc;
+}
+
+/*

diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index aee061c..235bfa7 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3065,22 +3064,6 @@ static int sh_page_fault(struct vcpu *v,
                 break;
             }
 
-            /*
-             * Do not police writes to guest memory emanating from the Xen
-             * kernel. Trying to do so will cause the same pagefault to occur
-             * over and over again with an event being sent to the access
-             * listener for each fault. If the access listener's vcpu is not
-             * scheduled during this time, the violation is never resolved and
-             * will eventually end with the host crashing.
-             */
-            if ( (violation && access_w) &&
-                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
-            {
-                violation = 0;
-                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
-                                    p2m_ram_rw, p2m_access_rw);
-            }
-
             if ( violation )
             {
                 paddr_t gpa = (mfn_x(gmfn) << PAGE_SHIFT) +
@@ -3095,7 +3078,16 @@ static int sh_page_fault(struct vcpu *v,
                                   (access_w ? 'w' : '-'),
                                   (access_x ? 'x' : '-'),
                                   mfn_x(gmfn), p2ma);
-                    /* Rights not promoted, vcpu paused, work here is done */
+
+                    if ( !mem_event_check_duplicate(&d->mem_event->access,
+                                                    req_ptr) )
+                        /* Rights not promoted, vcpu paused, work here is done 
*/
+                        goto out_put_gfn;
+
+                    /* Duplicate event detected. Unpause the vcpu and exit */
+                    xfree(req_ptr);
+                    req_ptr = NULL;
+                    vcpu_unpause(current);
                     goto out_put_gfn;
                 }
             }

With this I am able to prevent recursive identical faults in the runstate area 
causing the assert to fire. However there still exists a corner case. I still 
do not know what to do about the situation when the ring is full, the last 
unconsumed event was not for the runstate area, the runstate area is marked not 
writable and the guest or xen writes to it. I _think_ this would again cause 
the assert to fire. Can you give me some guidance as to what to do in this 
situation? Should I corner case the runstate area?

PS: I did try running with a hacked up version where the max ring buffer 
entries (nr_ents) is 1 and I could not make this situation happen but I guess 
it is still theoretically possible. Or am I mistaken?

Thanks,
Aravindh


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.