[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC/PATCH v2] XENMEM_claim_pages (subop of existing) hypercall



> Hi Jan --
> 
> Thanks for the quick review!
> 
> Mostly fixed for v3 but a couple things:
> 
>>> +     * Claimed memory is considered unavailable unless the request
>>> +     * is made by a domain with sufficient unclaimed pages.
>>> +     */
>>> +    if ( (total_unclaimed_pages + request >
>>> +           total_avail_pages + tmem_freeable_pages()) &&
>>> +          (d == NULL || d->unclaimed_pages < request) )
>>> +        goto not_found;
>> 
>> The treatment of d being NULL certainly needs further thought:
>> Is it really better to fulfill the claim and fail some (perhaps
>> important) _xmalloc()?
> 
> Ideally, allocation in the presence of existing claims should
> behave as if the claiming domains had actually already allocated
> the unclaimed-amount-of-memory.  So I'd argue that enforcing
> the claim should be sacrosanct here.

Well, are we sure that failing an "anonymous" allocations is not going to 
trigger a BUG_ON? That's a lot of code review. If you get this wrong, now Xen 
suddenly crashes if allocating domains close to the max. It doesn't, today, 
afaict.

> 
>> Also, I'm missing a mechanism by which the tools could find out
>> how much unclaimed memory is available, in order to determine
>> (if in use) how much memory needs to be ballooned out of Dom0.
> 
> OK.  I'm not certain if this will be useful on a per-domain
> basis as well but, for completeness, I will also add
> unclaimed_pages into xc_dominfo etc (which causes a bump
> in XEN_DOMCTL_INTERFACE_VERSION).
> 
>> Similarly, but perhaps of lower priority, there is no integration
>> with the low-mem handling.
> 
> I'd also consider this lower priority as Olaf and Andre
> have argued that the claim mechanism is not needed for
> sharing/paging so the two mechanisms may not
> be used together, at least for the foreseeable future.
> So I plan to skip this, unless you change your mind and
> consider it a showstopper for acceptance.

This is a slippery slope. Let's not work out the interactions with existing 
subsystems before adding code to the tree. What could go wrong?

As a data point for everyone, I've found the low men virq extremely useful as a 
sync interrupt signaling to our toolstack that it needs to get its act together 
and start rebalancing memory, if it hasn't yet. I don't see how that cannot be 
useful to any other toolstack.

Andres

> 
>> Finally, there still are a number of formatting issues.
> 
> Hmmm... I found one I think.  Is there an equivalent to
> checkpatch for hypervisor code?  If you see any formatting
> issues in v3, please call them out explicitly as I am
> sincerely trying to avoid them.
> 
> Thanks,
> Dan
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Thu, 15 Nov 2012 18:23:09 +0000
> From: Mats Petersson <mats.petersson@xxxxxxxxxx>
> To: Tim Deegan <tim@xxxxxxx>
> Cc: "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>
> Subject: Re: [Xen-devel] [PATCH V2] xen: vmx: Use an INT 2 call to
>       process real NMI's instead of self_nmi() in VMEXIT handler
> Message-ID: <50A5330D.9020204@xxxxxxxxxx>
> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
> 
> On 15/11/12 17:44, Tim Deegan wrote:
>> At 17:33 +0000 on 15 Nov (1353000782), Mats Petersson wrote:
>>> On 15/11/12 17:15, Tim Deegan wrote:
>>>> At 17:03 +0000 on 15 Nov (1352998993), Mats Petersson wrote:
>>>>>> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
>>>>>> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
>>>>>> are still bad).
>>>>>> 
>>>>>> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
>>>>>> guests.  We don't really have to but it saves time in the context switch
>>>>>> not to update the IDT.  Using do_nmi() here means that the first NMI is
>>>>>> handled on the normal stack instead.  It's also consistent with the way
>>>>>> we call do_machine_check() for the MCE case.  But it needs an explicit
>>>>>> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
>>>>> Both AMD and Intel has an identical setup with regard to stacks and
>>>>> general "what happens when we taken one of these interrupts".
>>>> My reading of svm_ctxt_switch_{to,from} makes me disagree with this.
>>>> AFAICT, on SVM we're not using dedicated stacks at all.
>>> In SVM, the VMRUN returns to whatever stack you had before the VMRUN.
>>> This is not what I'm talking about, however. The stack used for the NMI
>>> and MCE comes from the interrupt descriptor entry for those respective
>>> vectors.
>> This is the code I was referring to:
>> 
>>     /*
>>      * Cannot use ISTs for NMI/#MC/#DF while we are running with the guest 
>> TR.
>>      * But this doesn't matter: the IST is only req'd to handle 
>> SYSCALL/SYSRET.
>>      */
>>     idt_tables[cpu][TRAP_double_fault].a  &= ~(7UL << 32);
>>     idt_tables[cpu][TRAP_nmi].a           &= ~(7UL << 32);
>>     idt_tables[cpu][TRAP_machine_check].a &= ~(7UL << 32);
>> 
>> Am I misreading it?
> 
> No, you are reading it perfectly right, I'm wrong...
> 
> --
> Mats
>> 
>>> So in conclusion, the do_mce_exception() call probably should be a
>>> __asm__ __volatile__("int $X"), where X is the relevant vector.
>> This handles MCEs that were raised in guest context.  If we've managed
>> to get this far into the exit handler, the hypervisor stack is probably
>> OK. :)
>> 
>> I'd be happy to invoke the MCE handler though the IDT here, just for
>> symmetry with the other cases, but I don't think it makes much
>> difference.
>> 
>> Tim.
>> 
>> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Thu, 15 Nov 2012 13:29:28 -0500
> From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
> Cc: "netdev@xxxxxxxxxxxxxxx" <netdev@xxxxxxxxxxxxxxx>,        ANNIE LI
>       <annie.li@xxxxxxxxxx>,  "xen-devel@xxxxxxxxxxxxxxxxxxx"
>       <xen-devel@xxxxxxxxxxxxxxxxxxx>,        Roger Pau Monne
>       <roger.pau@xxxxxxxxxx>
> Subject: Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in
>       xen-netfront/netback
> Message-ID: <20121115182928.GB22320@xxxxxxxxxxxxxxxxxxx>
> Content-Type: text/plain; charset=iso-8859-1
> 
> On Thu, Nov 15, 2012 at 11:15:06AM +0000, Ian Campbell wrote:
>> On Thu, 2012-11-15 at 10:56 +0000, Roger Pau Monne wrote:
>>> On 15/11/12 09:38, ANNIE LI wrote:
>>>> 
>>>> 
>>>> On 2012-11-15 15:40, Pasi K?rkk?inen wrote:
>>>>> Hello,
>>>>> 
>>>>> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
>>>>>> This patch implements persistent grants for xen-netfront/netback. This
>>>>>> mechanism maintains page pools in netback/netfront, these page pools is 
>>>>>> used to
>>>>>> save grant pages which are mapped. This way improve performance which is 
>>>>>> wasted
>>>>>> when doing grant operations.
>>>>>> 
>>>>>> Current netback/netfront does map/unmap grant operations frequently when
>>>>>> transmitting/receiving packets, and grant operations costs much cpu 
>>>>>> clock. In
>>>>>> this patch, netfront/netback maps grant pages when needed and then saves 
>>>>>> them
>>>>>> into a page pool for future use. All these pages will be unmapped when
>>>>>> removing/releasing the net device.
>>>>>> 
>>>>> Do you have performance numbers available already? with/without 
>>>>> persistent grants?
>>>> I have some simple netperf/netserver test result with/without persistent 
>>>> grants,
>>>> 
>>>> Following is result of with persistent grant patch,
>>>> 
>>>> Guests, Sum,      Avg,     Min,     Max
>>>>  1,  15106.4,  15106.4, 15106.36, 15106.36
>>>>  2,  13052.7,  6526.34,  6261.81,  6790.86
>>>>  3,  12675.1,  6337.53,  6220.24,  6454.83
>>>>  4,  13194,  6596.98,  6274.70,  6919.25
>>>> 
>>>> 
>>>> Following are result of without persistent patch
>>>> 
>>>> Guests, Sum,     Avg,    Min,        Max
>>>>  1,  10864.1,  10864.1, 10864.10, 10864.10
>>>>  2,  10898.5,  5449.24,  4862.08,  6036.40
>>>>  3,  10734.5,  5367.26,  5261.43,  5473.08
>>>>  4,  10924,    5461.99,  5314.84,  5609.14
>>> 
>>> In the block case, performance improvement is seen when using a large
>>> number of guests, could you perform the same benchmark increasing the
>>> number of guests to 15?
>> 
>> It would also be nice to see some analysis of the numbers which justify
>> why this change is a good one without every reviewer having to evaluate
>> the raw data themselves. In fact this should really be part of the
>> commit message.
> 
> You mean like a nice graph, eh?
> 
> I will run these patches on my 32GB box and see if I can give you
> a nice PDF/jpg.
> 
>> 
>> Ian.
>> 
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Thu, 15 Nov 2012 18:29:13 +0000
> From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
> To: Michael Palmeter <michael.palmeter@xxxxxxxxxx>
> Cc: Dario Faggioli <raistlin@xxxxxxxx>,       "xen-devel@xxxxxxxxxxxxx"
>       <xen-devel@xxxxxxxxxxxxx>
> Subject: Re: [Xen-devel] Xen credit scheduler question
> Message-ID: <50A53479.5050901@xxxxxxxxxxxxx>
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
> 
> On 15/11/12 15:43, Michael Palmeter wrote:
>> 
>> Hi all (and Mr. Dunlap in particular),
>> 
> 
> Haha -- please don't call me "Mr"; I prefer "George", but if you want a 
> title, use "Dr" (since I have  PhD). :-)
> 
>> Example scenario:
>> 
>>  * Server hardware: 2 sockets, 8-cores per socket, 2 hardware threads
>>    per core (total of 32 hardware threads)
>>  * Test VM: a single virtual machine with a single vCPU, weight=256
>>    and cap=100%
>> 
>> In this scenario, from what I understand, I should be able to load the 
>> Test VM with traffic to a maximum of approximately 1/32 of the 
>> aggregate compute capacity of the server.  The total CPU utilization 
>> of the server hardware should be approximately 3.4%, plus the overhead 
>> of dom0 (say 1-2).  The credits available to any vCPU capped at 100% 
>> should be equal to 1/32 of the aggregate compute available for the 
>> whole server, correct?
>> 
> 
> I think to really be precise, you should say, "1/32nd of the logical cpu 
> time available", where "logical cpu time" simply means, "time processing 
> on one logical CPU".  At the moment, that is all that either the credit1 
> or credit2 schedulers look at.
> 
> As I'm sure you're aware, not all "logical cpu time" is equal.  If one 
> thread of a hyperthread pair is running but the other idle, it will get 
> significantly higher performance than if the other thread is busy.  How 
> much is highly unpredictable, and depends very much on exactly what 
> units are shared with the other hyperthread, and the workload running on 
> each unit.  But even when both threads are busy, it should (in theory) 
> be rare for both threads to get a throughput of 50%; the whole idea of 
> HT is that threads typically get 70-80% of the full performance of the 
> core (so the overall throughput is increased).
> 
> But if course, while this is particularly extreme in the case of 
> hyperthreads, it's also true on a smaller scale even without that -- 
> cores share caches, NUMA nodes share memory bandwidth, and so on. No 
> attempt is made to compensate VMs for cache misses or extra memory 
> latency due to sharing either. :-)
> 
>> Put simply, is there a way to constrain a VM with 1 vCPU to consume no 
>> more than 0.5 of a physical core (hyper-threaded) on the server 
>> hardware mentioned below? Does the cap help in that respect?
>> 
> 
> You can use "cap" to make the VM in question get 50% of logical vcpu 
> time, which on an idle system will give it 0.5 of the capacity of a 
> physical core (if we don't consider Intel's Turbo Boost technology).  
> But if the system becomes busy, it will get less than 0.5 of the 
> processing capacity of a physical core.
> 
>> I have been struggling to understand how the scheduler can deal with 
>> the uncertainty that hyperthreading introduces, however.  I know this 
>> is an issue that you are tackling in the credit2 scheduler, but I 
>> would like to know what your thoughts are on this problem (if you are 
>> able to share).  Any insight or assistance you could offer would be 
>> greatly appreciated.
>> 
> 
> At the moment it does not attempt to; the only thing it does is try not 
> to schedule two hyperthreads that share a core if there is an idle 
> core.  But if there are more active vcpus than cores, then some will 
> share; and the ones that share a core with another vcpu will be charged 
> the same as the ones that have the core all to themselves.
> 
> Could you explain why you your question is important to you -- i.e,. 
> what are you trying to accomplish?  It sounds a bit like you're more 
> concerned with accuracy in reporting, and control of resources, rather 
> than fairness, for instance.
> 
>  -George
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://lists.xen.org/archives/html/xen-devel/attachments/20121115/b2ff8583/attachment.html>
> 
> ------------------------------
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
> 
> 
> End of Xen-devel Digest, Vol 93, Issue 157
> ******************************************


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.