[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 01/10] evtchn: use per-channel lock where possible
Hi Jan, On 11/01/2021 10:14, Jan Beulich wrote: On 08.01.2021 21:32, Julien Grall wrote:Hi Jan, On 05/01/2021 13:09, Jan Beulich wrote:Neither evtchn_status() nor domain_dump_evtchn_info() nor flask_get_peer_sid() need to hold the per-domain lock - they all only read a single channel's state (at a time, in the dump case). Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> --- v4: New. --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -968,15 +968,16 @@ int evtchn_status(evtchn_status_t *statu if ( d == NULL ) return -ESRCH;- spin_lock(&d->event_lock);- if ( !port_is_valid(d, port) )There is one issue that is now becoming more apparent. To be clear, the problem is not in this patch, but I think it is the best place to discuss it as d->event_lock may be part of the solution. After XSA-344, evtchn_destroy() will end up to decrement d->valid_evtchns. Given that evtchn_status() can work on the non-current domain, it would be possible to run it concurrently with evtchn_destroy(). As a consequence, port_is_valid() will be unstable as a valid event channel may turn invalid. AFAICT, we are getting away so far, as the memory is not freed until the domain is fully destroyed. However, we re-introduced XSA-338 in a different way. To be clear this is not the fault of this patch. But I don't think this is sane to re-introduce a behavior that lead us to an XSA.I'm getting confused, I'm afraid, from the varying statements above: Are you suggesting this patch does re-introduce bad behavior or not? No. I am pointing out that this is widening the bad behavior (again). Yes, the decrementing of ->valid_evtchns has a similar effect, but I'm not convinced it gets us into XSA territory again. The problem wasn't the reducing of ->max_evtchns as such, but the derived assumptions elsewhere in the code. If there were any such again, I suppose we'd have reason to issue another XSA. I don't think it get us to the XSA territory yet. However, the locking/interaction in the event channel code is quite complex. To give a concrete example, below the current implementation of free_xen_event_channel(): if ( !port_is_valid(d, port) ) { /* * Make sure ->is_dying is read /after/ ->valid_evtchns, pairing * with the spin_barrier() and BUG_ON() in evtchn_destroy(). */ smp_rmb(); BUG_ON(!d->is_dying); return; } evtchn_close(d, port, 0);It would be fair for a developer to assume that after the check above, port_is_valid() would return true. However, this is not the case... I am not aware of any issue so far... But I am not ready to be this is not going to be missed out. How about you? > If there were any such again, I > suppose we'd have reason to issue another XSA.The point of my e-mail is to prevent this XSA to happen. I am pretty sure you want the same. Furthermore there are other paths already using port_is_valid() without holding the domain's event lock; I've not been able to spot a problem with this though, so far. Right. Most of the fine are fine because d == current. Therefore, the domain must be running and evtchn_destroy() couldn't happen concurrently. I can see two solutions: 1) Use d->event_lock to protect port_is_valid() when d != current->domain. This would require evtchn_destroy() to grab the lock when updating d->valid_evtchns. 2) Never decrement d->valid_evtchns and use a different field for closing ports I am not a big fan of 1) because this is muddying the already complex locking situation in the event channel code. But I suggested it because I wasn't sure whether you would be happy with 2).I agree 1) wouldn't be very nice, and you're right in assuming I wouldn't like 2) very much. For the moment I'm not (yet) convinced we need to do anything at all - as you say yourself, while the result of port_is_valid() is potentially unstable when a domain is in the process of being cleaned up, the state guarded by such checks remains usable in (I think) a race free manner. It remains usable *today*, the question is how long this will last?All the recent XSAs in the event channel taught me that the locking/interaction is extremely complex. This series is another proof. We would save us quite a bit of trouble by making port_is_valid() stable no matter the state of the domain. I think an extra field (option 2) is quite a good compromise with space use, maintenance, speed. I am would be interested to hear from others. Cheers, -- Julien Grall
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |