[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issue: Networking performance in Xen VM on Arm64

To: Stefano Stabellini <sstabellini@xxxxxxxxxx>
From: Leo Yan <leo.yan@xxxxxxxxxx>
Date: Mon, 24 Oct 2022 18:55:10 +0800
Cc: Xen Develop <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Mathieu Poirier <mathieu.poirier@xxxxxxxxxx>, Kasper Ornstein Mecklenburg <Kasper.OrnsteinMecklenburg@xxxxxxx>, jgross@xxxxxxxx, oleksandr_tyshchenko@xxxxxxxx, boris.ostrovsky@xxxxxxxxxx, wei.liu@xxxxxxxxxx, paul@xxxxxxx
Delivery-date: Mon, 24 Oct 2022 10:55:25 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, Oct 21, 2022 at 02:14:04PM -0700, Stefano Stabellini wrote:

[...]

> > > # domU side
> > > xen/arch/arm/vgic-v2.c:vgic_v2_to_sgi
> > > xen/arch/arm/vgic.c:vgic_to_sgi
> > > xen/arch/arm/vgic.c:vgic_inject_irq
> > > xen/arch/arm/vgic.c:vcpu_kick
> > > xen/arch/arm/gic-v2.c:gicv2_send_SGI
> > > 
> > > # dom0 side
> > > xen/arch/arm/gic.c:do_sgi
> > > xen/arch/arm/traps.c:leave_hypervisor_to_guest
> > > 
> > > It would be good to understand why sometimes it takes ~10us and some
> > > other times it takes ~540us
> > 
> > Some updates for why it takes several hundreds us for Xen backend driver
> > to respond interrupt.  The short answer is the vcpu running Xen backend
> > driver needs to switch context, even I have set options "sched=null
> > vwfi=native" in Xen command line.
> > 
> > So please see below detailed logs for how the things happen.
> > 
> > Let's take the timestamp 3842008681 as the start point, it's the time
> > for Xen backend driver sending out notification (xennet_notify_tx_irq);
> > at the timestamp 3842008885 the Xen hypervisor injects the interrupt
> > (it's about ~8us duration from the start point).
> > 
> > And then at the timestamp 3842008935 it invokes vcpu_kick() to kick the
> > virtual CPU for running Xen forend driver, you could see
> > VCPU_PROCESSOR is 11 and VCPU_ID is 9 for dom0, the duration is
> > 10.16us from the start point.
> > 
> > The key point is at this point the vcpu's is_running is 0, this is
> > different from the case without long latency which vcpu's is_running
> > is 1.  IIUC, Xen hypervisor needs to take time to restore the vcpu's
> > context, thus we can see the virtual CPU 9 in Dom0 starts to run at
> > the timestamp 3842016505.
> 
> is_running should be always 1 with the NULL scheduler and vwfi=native.
> That is because VMs are never descheduled. Please double-check.
> 
> If you are really running with the NULL scheduler, then I would
> investigate why the vCPU has is_running == 0 because it should not
> happen.

Correct for this: it's my bad that I didn't really enable NULL scheduler
in my code base.  After I enabled NULL scheduler, the latency by context
switching is dismissed.

 8963              pub-338   [002]   217.777652: bprint:               
xennet_tx_setup_grant: id=60 ref=1340 offset=2 len=1514 TSC: 7892178799
 8964              pub-338   [002]   217.777662: bprint:               
xennet_tx_setup_grant: id=82 ref=1362 offset=2050 len=1006 TSC: 7892179043
 8965     ksoftirqd/12-75    [012]   255.466914: bprint:               
xenvif_tx_build_gops.constprop.0: id=60 ref=1340 offset=2 len=1514 TSC: 
7892179731
 8966     ksoftirqd/12-75    [012]   255.466915: bprint:               
xenvif_tx_build_gops.constprop.0: id=82 ref=1362 offset=2050 len=1006 TSC: 
7892179761
 8967              pub-338   [002]   217.778057: bprint:               
xennet_tx_setup_grant: id=60 ref=1340 offset=2050 len=1514 TSC: 7892188930
 8968              pub-338   [002]   217.778072: bprint:               
xennet_tx_setup_grant: id=53 ref=1333 offset=2 len=1514 TSC: 7892189293
 8969       containerd-2965  [012]   255.467304: bprint:               
xenvif_tx_build_gops.constprop.0: id=60 ref=1340 offset=2050 len=1514 TSC: 
7892189479
 8970       containerd-2965  [012]   255.467306: bprint:               
xenvif_tx_build_gops.constprop.0: id=53 ref=1333 offset=2 len=1514 TSC: 
7892189533

So the xennet (Xen net forend driver) and xenvif (net backend driver)
work in parallel.  Please note, I didn't see networking performance
improvement after changed to use NULL scheduler.

Now I will compare the duration for two directions, one direction is
sending data from xennet to xenvif, and another is the reversed
direction.  It's very likely the two directions have significant
difference for sending data with grant tables, you could see in above
log, it takes 20~30us to transmit a data block (we can use the id
number and grant table's ref number to match the data block in xennet
driver and xenvif driver).

> Now regarding the results, I can see the timestamp 3842008681 for
> xennet_notify_tx_irq, 3842008885 for vgic_inject_irq, and 3842008935 for
> vcpu_kick. Where is the corresponding TSC for the domain receiving the
> notification?
> 
> Also for the other case, starting at 3842016505, can you please
> highlight the timestamp for vgic_inject_irq, vcpu_kick, and also the one
> for the domain receiving the notification?
> 
> The most interesting timestamps would be the timestamp for vcpu_kick in
> "notification sending domain" [a], the timestamp for receiving the
> interrupt in the Xen on pCPU for the "notification receiving domain"
> [b], and the timestamp for the "notification receiving domain" getting
> the notification [c].
> 
> If really context switch is the issue, then the interesting latency
> would be between [a] and [b].

Understand.  I agree that I didn't move into more details, the main
reason is Xen dmesg buffer is fragile after adding more logs, e.g.
after I added log in the function gicv3_send_sgi(), Xen will stuck
during the booting phase, and after adding logs in
leave_hypervisor_to_guest() it will introduce huge logs (so I need to
only trace for first 16 CPUs to mitigate log flood).

I think it would be better to enable xentrace for my profiling at my
side.  If I have any further data, will share back.

[...]

> > Another experiment is to append option "dom0_vcpus_pin" in Xen command
> > line, this can allow the virtual CPU pins on physical CPU so can avoid
> > context switching.  With this option, I can see the throughput is
> > improved from 128.26 Mbits/s to 180.3 Mbits/s.  But this result is far
> > from the ideal throughput (800+Mbits/s).
> 
> Actually this should not happen: like before, dom0_vcpus_pin does
> nothing with the NULL scheduler and only changes things for credit. Are
> you really sure you are running the tests with sched=null vwfi=native?

You are right.  After I changed to use NULL scheduler, dom0_vcpus_pin
doesn't improve the performance, sorry for noise and please ignore it.

Thanks,
Leo

Follow-Ups:
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Stefano Stabellini

References:
- Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Stefano Stabellini
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Stefano Stabellini
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Stefano Stabellini
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Stefano Stabellini

Prev by Date: Re: Deadcode discussion based on Arm NS phys timer
Next by Date: Re: [PATCH v13 17/17] net: stream: add QAPI events to report connection state
Previous by thread: Re: Issue: Networking performance in Xen VM on Arm64
Next by thread: Re: Issue: Networking performance in Xen VM on Arm64
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.