[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] vmx: VT-d posted-interrupt core logic handling



On 10/03/16 10:18, Jan Beulich wrote:
>>>> On 10.03.16 at 11:05, <kevin.tian@xxxxxxxxx> wrote:
>>>  From: Tian, Kevin
>>> Sent: Thursday, March 10, 2016 5:20 PM
>>>
>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>>>> Sent: Thursday, March 10, 2016 5:06 PM
>>>>
>>>>
>>>>> There are many linked list usages today in Xen hypervisor, which
>>>>> have different theoretical maximum possible number. The closest
>>>>> one to PI might be the usage in tmem (pool->share_list) which is
>>>>> page based so could grow 'overly large'. Other examples are
>>>>> magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
>>>>> could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
>>>>> in MSI-x virtualization (which could be 2^11 per spec). Do we
>>>>> also want to create some artificial scenarios to examine them
>>>>> since based on actual operation K-level entries may also become
>>>>> a problem?
>>>>>
>>>>> Just want to figure out how best we can solve all related linked-list
>>>>> usages in current hypervisor.
>>>>
>>>> As you say, those are (perhaps with the exception of tmem, which
>>>> isn't supported anyway due to XSA-15, and which therefore also
>>>> isn't on by default) in the order of a few thousand list elements.
>>>> And as mentioned above, different bounds apply for lists traversed
>>>> in interrupt context vs such traversed only in "normal" context.
>>>>
>>>
>>> That's a good point. Interrupt context should have more restrictions.
>>
>> Hi, Jan,
>>
>> I'm thinking your earlier idea about evenly distributed list:
>>
>> --
>> Ah, right, I think that limitation was named before, yet I've
>> forgotten about it again. But that only slightly alters the
>> suggestion: To distribute vCPU-s evenly would then require to
>> change their placement on the pCPU in the course of entering
>> blocked state.
>> --
>>
>> Actually after more thinking, there is no hard requirement that
>> the vcpu must block on the pcpu which is configured in 'NDST'
>> of that vcpu's PI descriptor. What really matters, is that the
>> vcpu is added to the linked list of the very pcpu, then when PI
>> notification comes we can always find out the vcpu struct from
>> that pcpu's linked list. Of course one drawback of such placement
>> is additional IPI incurred in wake up path.
>>
>> Then one possible optimized policy within vmx_vcpu_block could 
>> be:
>>
>> (Say PCPU1 which VCPU1 is currently blocked on)
>> - As long as the #vcpus in the linked list on PCPU1 is below a 
>> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
>> Upon PI notification on PCPU1, local linked list is searched to
>> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>>
>> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution 
>> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
>> but NDST set to PCPU2. Upon notification on PCPU2, local linked
>> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to 
>> unblock VCPU1;
> 
> Sounds possible, if the lock handling can be got right. But of
> course there can't be any hard limit like 16, at least not alone
> (on a systems with extremely many mostly idle vCPU-s we'd
> need to allow larger counts - see my earlier explanations in this
> regard).

A lot of the scheduling code uses spin_trylock() to just skip over pcpus
that are busy when doing this sort of load-balancing.  Using a hash to
choose a default and then cycling through pcpus until you find one whose
lock you can grab should be reasonably efficient.

Re "an IPI is sent to PCPU1", all that should be transparent to the PI
code -- it already calls vcpu_unblock(), which will call vcpu_wake(),
which calls the scheduling wake code, which will DTRT.

FWIW I have much less objection to this sort of solution if it were
confined to the PI arch_block() callback, rather than something that
required changes to the schedulers.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.