[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issue: Networking performance in Xen VM on Arm64

To: Leo Yan <leo.yan@xxxxxxxxxx>
From: Stefano Stabellini <sstabellini@xxxxxxxxxx>
Date: Tue, 25 Oct 2022 16:58:16 -0700 (PDT)
Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Xen Develop <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Mathieu Poirier <mathieu.poirier@xxxxxxxxxx>, Kasper Ornstein Mecklenburg <Kasper.OrnsteinMecklenburg@xxxxxxx>, jgross@xxxxxxxx, oleksandr_tyshchenko@xxxxxxxx, boris.ostrovsky@xxxxxxxxxx, wei.liu@xxxxxxxxxx, paul@xxxxxxx
Delivery-date: Tue, 25 Oct 2022 23:58:25 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, 24 Oct 2022, Leo Yan wrote:
> > If you are really running with the NULL scheduler, then I would
> > investigate why the vCPU has is_running == 0 because it should not
> > happen.
> 
> Correct for this: it's my bad that I didn't really enable NULL scheduler
> in my code base.  After I enabled NULL scheduler, the latency by context
> switching is dismissed.
> 
>  8963              pub-338   [002]   217.777652: bprint:               
> xennet_tx_setup_grant: id=60 ref=1340 offset=2 len=1514 TSC: 7892178799
>  8964              pub-338   [002]   217.777662: bprint:               
> xennet_tx_setup_grant: id=82 ref=1362 offset=2050 len=1006 TSC: 7892179043
>  8965     ksoftirqd/12-75    [012]   255.466914: bprint:               
> xenvif_tx_build_gops.constprop.0: id=60 ref=1340 offset=2 len=1514 TSC: 
> 7892179731
>  8966     ksoftirqd/12-75    [012]   255.466915: bprint:               
> xenvif_tx_build_gops.constprop.0: id=82 ref=1362 offset=2050 len=1006 TSC: 
> 7892179761
>  8967              pub-338   [002]   217.778057: bprint:               
> xennet_tx_setup_grant: id=60 ref=1340 offset=2050 len=1514 TSC: 7892188930
>  8968              pub-338   [002]   217.778072: bprint:               
> xennet_tx_setup_grant: id=53 ref=1333 offset=2 len=1514 TSC: 7892189293
>  8969       containerd-2965  [012]   255.467304: bprint:               
> xenvif_tx_build_gops.constprop.0: id=60 ref=1340 offset=2050 len=1514 TSC: 
> 7892189479
>  8970       containerd-2965  [012]   255.467306: bprint:               
> xenvif_tx_build_gops.constprop.0: id=53 ref=1333 offset=2 len=1514 TSC: 
> 7892189533

I am having difficulty following the messages. Are the two points [a]
and [b] as described in the previous email shown here?


> So the xennet (Xen net forend driver) and xenvif (net backend driver)
> work in parallel.  Please note, I didn't see networking performance
> improvement after changed to use NULL scheduler.
> 
> Now I will compare the duration for two directions, one direction is
> sending data from xennet to xenvif, and another is the reversed
> direction.  It's very likely the two directions have significant
> difference for sending data with grant tables, you could see in above
> log, it takes 20~30us to transmit a data block (we can use the id
> number and grant table's ref number to match the data block in xennet
> driver and xenvif driver).
> 
> > Now regarding the results, I can see the timestamp 3842008681 for
> > xennet_notify_tx_irq, 3842008885 for vgic_inject_irq, and 3842008935 for
> > vcpu_kick. Where is the corresponding TSC for the domain receiving the
> > notification?
> > 
> > Also for the other case, starting at 3842016505, can you please
> > highlight the timestamp for vgic_inject_irq, vcpu_kick, and also the one
> > for the domain receiving the notification?
> > 
> > The most interesting timestamps would be the timestamp for vcpu_kick in
> > "notification sending domain" [a], the timestamp for receiving the
> > interrupt in the Xen on pCPU for the "notification receiving domain"
> > [b], and the timestamp for the "notification receiving domain" getting
> > the notification [c].
> > 
> > If really context switch is the issue, then the interesting latency
> > would be between [a] and [b].
> 
> Understand.  I agree that I didn't move into more details, the main
> reason is Xen dmesg buffer is fragile after adding more logs, e.g.
> after I added log in the function gicv3_send_sgi(), Xen will stuck
> during the booting phase, and after adding logs in
> leave_hypervisor_to_guest() it will introduce huge logs (so I need to
> only trace for first 16 CPUs to mitigate log flood).
> 
> I think it would be better to enable xentrace for my profiling at my
> side.  If I have any further data, will share back.

Looking forward to it. Without more details it is impossible to identify
the source of the problem and fix it.

Follow-Ups:
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan

References:
- Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Stefano Stabellini
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Stefano Stabellini
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Stefano Stabellini
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Stefano Stabellini
- Re: Issue: Networking performance in Xen VM on Arm64
  - From: Leo Yan

Prev by Date: [ovmf test] 174412: all pass - PUSHED
Next by Date: RE: [PATCH for-4.14-to-4.16 0/2] Backports for XSA-409 fixes
Previous by thread: Re: Issue: Networking performance in Xen VM on Arm64
Next by thread: Re: Issue: Networking performance in Xen VM on Arm64
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.