[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCHv2 0/3] Implement per-cpu reader-writer locks
On 25/11/15 08:58, Malcolm Crossley wrote: > On 24/11/15 18:30, George Dunlap wrote: >> On 24/11/15 18:16, George Dunlap wrote: >>> On 20/11/15 16:03, Malcolm Crossley wrote: >>>> This patch series adds per-cpu reader-writer locks as a generic lock >>>> implementation and then converts the grant table and p2m rwlocks to >>>> use the percpu rwlocks, in order to improve multi-socket host performance. >>>> >>>> CPU profiling has revealed the rwlocks themselves suffer from severe cache >>>> line bouncing due to the cmpxchg operation used even when taking a read >>>> lock. >>>> Multiqueue paravirtualised I/O results in heavy contention of the grant >>>> table >>>> and p2m read locks of a specific domain and so I/O throughput is >>>> bottlenecked >>>> by the overhead of the cache line bouncing itself. >>>> >>>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data >>>> area to record a CPU has taken the read lock. Correctness is enforced for >>>> the >>>> write lock by using a per lock barrier which forces the per-cpu read lock >>>> to revert to using a standard read lock. The write lock then polls all >>>> the percpu data area until active readers for the lock have exited. >>>> >>>> Removing the cache line bouncing on a multi-socket Haswell-EP system >>>> dramatically improves performance, with 16 vCPU network IO performance >>>> going >>>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 >>>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even >>>> better >>>> IO improvement. >>> >>> Impressive -- thanks for doing this work. > > Thanks, I think the key to isolating the problem was using profiling tools. > The scale > of the overhead would not have been clear without them. > >>> >>> One question: Your description here sounds like you've tested with a >>> single large domain, but what happens with multiple domains? >>> >>> It looks like the "per-cpu-rwlock" is shared by *all* locks of a >>> particular type (e.g., all domains share the per-cpu p2m rwlock). >>> (Correct me if I'm wrong here.) >> >> Sorry, looking in more detail at the code, it seems I am wrong. The >> fast-path stores which "slow" lock has been grabbed in the per-cpu >> variable; so the writer only needs to wait for readers that have grabbed >> the particular lock it's interested in. So the scenarios I outline >> below shouldn't really be issues. >> >> The description of the algorithm in the changelog could do with a bit >> more detail. :-) > > I'll enhance the description to say "per lock local variable" to make it > clearer > that not all readers will be affected. > > BTW, I added to the "To" list because I need your ACK for the patch to the > p2m code. > > Do you have any review comments for that patch? Yes, I realize that, and I'll get to it. :-) -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |