[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Issue: Networking performance in Xen VM on Arm64
On Fri, Oct 21, 2022 at 02:14:04PM -0700, Stefano Stabellini wrote: [...] > > > # domU side > > > xen/arch/arm/vgic-v2.c:vgic_v2_to_sgi > > > xen/arch/arm/vgic.c:vgic_to_sgi > > > xen/arch/arm/vgic.c:vgic_inject_irq > > > xen/arch/arm/vgic.c:vcpu_kick > > > xen/arch/arm/gic-v2.c:gicv2_send_SGI > > > > > > # dom0 side > > > xen/arch/arm/gic.c:do_sgi > > > xen/arch/arm/traps.c:leave_hypervisor_to_guest > > > > > > It would be good to understand why sometimes it takes ~10us and some > > > other times it takes ~540us > > > > Some updates for why it takes several hundreds us for Xen backend driver > > to respond interrupt. The short answer is the vcpu running Xen backend > > driver needs to switch context, even I have set options "sched=null > > vwfi=native" in Xen command line. > > > > So please see below detailed logs for how the things happen. > > > > Let's take the timestamp 3842008681 as the start point, it's the time > > for Xen backend driver sending out notification (xennet_notify_tx_irq); > > at the timestamp 3842008885 the Xen hypervisor injects the interrupt > > (it's about ~8us duration from the start point). > > > > And then at the timestamp 3842008935 it invokes vcpu_kick() to kick the > > virtual CPU for running Xen forend driver, you could see > > VCPU_PROCESSOR is 11 and VCPU_ID is 9 for dom0, the duration is > > 10.16us from the start point. > > > > The key point is at this point the vcpu's is_running is 0, this is > > different from the case without long latency which vcpu's is_running > > is 1. IIUC, Xen hypervisor needs to take time to restore the vcpu's > > context, thus we can see the virtual CPU 9 in Dom0 starts to run at > > the timestamp 3842016505. > > is_running should be always 1 with the NULL scheduler and vwfi=native. > That is because VMs are never descheduled. Please double-check. > > If you are really running with the NULL scheduler, then I would > investigate why the vCPU has is_running == 0 because it should not > happen. Correct for this: it's my bad that I didn't really enable NULL scheduler in my code base. After I enabled NULL scheduler, the latency by context switching is dismissed. 8963 pub-338 [002] 217.777652: bprint: xennet_tx_setup_grant: id=60 ref=1340 offset=2 len=1514 TSC: 7892178799 8964 pub-338 [002] 217.777662: bprint: xennet_tx_setup_grant: id=82 ref=1362 offset=2050 len=1006 TSC: 7892179043 8965 ksoftirqd/12-75 [012] 255.466914: bprint: xenvif_tx_build_gops.constprop.0: id=60 ref=1340 offset=2 len=1514 TSC: 7892179731 8966 ksoftirqd/12-75 [012] 255.466915: bprint: xenvif_tx_build_gops.constprop.0: id=82 ref=1362 offset=2050 len=1006 TSC: 7892179761 8967 pub-338 [002] 217.778057: bprint: xennet_tx_setup_grant: id=60 ref=1340 offset=2050 len=1514 TSC: 7892188930 8968 pub-338 [002] 217.778072: bprint: xennet_tx_setup_grant: id=53 ref=1333 offset=2 len=1514 TSC: 7892189293 8969 containerd-2965 [012] 255.467304: bprint: xenvif_tx_build_gops.constprop.0: id=60 ref=1340 offset=2050 len=1514 TSC: 7892189479 8970 containerd-2965 [012] 255.467306: bprint: xenvif_tx_build_gops.constprop.0: id=53 ref=1333 offset=2 len=1514 TSC: 7892189533 So the xennet (Xen net forend driver) and xenvif (net backend driver) work in parallel. Please note, I didn't see networking performance improvement after changed to use NULL scheduler. Now I will compare the duration for two directions, one direction is sending data from xennet to xenvif, and another is the reversed direction. It's very likely the two directions have significant difference for sending data with grant tables, you could see in above log, it takes 20~30us to transmit a data block (we can use the id number and grant table's ref number to match the data block in xennet driver and xenvif driver). > Now regarding the results, I can see the timestamp 3842008681 for > xennet_notify_tx_irq, 3842008885 for vgic_inject_irq, and 3842008935 for > vcpu_kick. Where is the corresponding TSC for the domain receiving the > notification? > > Also for the other case, starting at 3842016505, can you please > highlight the timestamp for vgic_inject_irq, vcpu_kick, and also the one > for the domain receiving the notification? > > The most interesting timestamps would be the timestamp for vcpu_kick in > "notification sending domain" [a], the timestamp for receiving the > interrupt in the Xen on pCPU for the "notification receiving domain" > [b], and the timestamp for the "notification receiving domain" getting > the notification [c]. > > If really context switch is the issue, then the interesting latency > would be between [a] and [b]. Understand. I agree that I didn't move into more details, the main reason is Xen dmesg buffer is fragile after adding more logs, e.g. after I added log in the function gicv3_send_sgi(), Xen will stuck during the booting phase, and after adding logs in leave_hypervisor_to_guest() it will introduce huge logs (so I need to only trace for first 16 CPUs to mitigate log flood). I think it would be better to enable xentrace for my profiling at my side. If I have any further data, will share back. [...] > > Another experiment is to append option "dom0_vcpus_pin" in Xen command > > line, this can allow the virtual CPU pins on physical CPU so can avoid > > context switching. With this option, I can see the throughput is > > improved from 128.26 Mbits/s to 180.3 Mbits/s. But this result is far > > from the ideal throughput (800+Mbits/s). > > Actually this should not happen: like before, dom0_vcpus_pin does > nothing with the NULL scheduler and only changes things for credit. Are > you really sure you are running the tests with sched=null vwfi=native? You are right. After I changed to use NULL scheduler, dom0_vcpus_pin doesn't improve the performance, sorry for noise and please ignore it. Thanks, Leo
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |