Xen project Mailing List

[Xen-devel] x86's rwlock scalability and fairness

With RW_LOCK_BIAS being 0x01000000 and with (significantly) more than 128 CPUs in a system, there is a non-zero probability for 129 or more CPUs to simultaneously attempt to acquire an rwlock in write mode, and if sufficiently many (129) of them manage to subtract the bias from the counter subsequent attempts to acquire the lock for reading will succeed independent of whether the lock is already being held in write mode. The probability of this happening is obviously larger in a virtual environment (i.e. when the time window between the execution of two consecutive instructions can be unbounded). Consequently I think RW_LOCK_BIAS needs to be adjusted such that RW_LOCK_BIAS * NR_CPUS * 2 - 1 fits into the lock counter. For 32-bits (with a limit of 512 CPUs) this can be achieved without widening the counter (as even with a bias of 0x00400000 there's still room for 0x2000 nested read acquires on each CPU). For 64-bits, the counter would need to be widened at least for the MAXSMP case - the question is whether it should be widened unconditionally. There's also a fairness concern (again, likely more of a problem when running virtualized): Writers can get delayed unboundedly as long as new readers trickle in, since all writers undo their subtraction of the bias in __write_lock_failed. Simply blocking all future readers can't be done with the nesting semantics Linux uses for rwlocks, so the question is whether another scheme could be created (or had already been proposed - I did find a number of references, like the model sketched out by Linus at http://lkml.indiana.edu/hypermail/linux/kernel/0808.2/2861.html, which admittedly I'm not sure meets the nesting requirement -, but there must be reasons none of them made it into the kernel so far) that would allow nested acquires but disallows new CPUs to acquire the first instance of a lock when there's a waiting writer. One possible scheme would involve per-CPU lists of rwlocks held in read mode - if acquiring fails, the CPU would walk that list and decrement the counter anyway if it finds the lock already present on that list. The first writer to come in would then no longer undo its subtraction of the bias, thus forcing all future readers into the __read_lock_failed path. This, however, would require changes to all users of rwlocks as the linked list elements would need to live on the stack (in the scope of the top-most function common between lock and unlock), which doesn't look very nice. Similarly of concern (for virtualization again, but presumably also for rt) is the lack of interrupt re-enabling in the acquire-failed paths of arch_{read,write}_lock_flags(). Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.