[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issue: Networking performance in Xen VM on Arm64



On Mon, 17 Oct 2022, Leo Yan wrote:
> Hi Stefano,
> 
> Sorry for late response.  Please see below comments.
> 
> On Tue, Oct 11, 2022 at 02:47:00PM -0700, Stefano Stabellini wrote:
> > On Tue, 11 Oct 2022, Leo Yan wrote:
> > > > > The second question is how to mitigate the long latency when send data
> > > > > from DomU.  A possible solution is the Xen network forend driver 
> > > > > copies
> > > > > skb into mediate (bounce) buffer, just like what does in Xen net
> > > > > backend driver with gnttab_batch_copy(), in this way the forend driver
> > > > > doesn't need to wait for backend driver response and directly return
> > > > > back.
> > > > 
> > > > About this, I am not super familiar with drivers/net/xen-netfront.c but
> > > > I take you are referring to xennet_tx_buf_gc? Is that the function that
> > > > is causing xennet_start_xmit to wait?
> > > 
> > > No.  We can take the whole flow in xen-netfront.c as:
> > > 
> > >   xennet_start_xmit()
> > >              ----------> notify Xen Dom0 to process skb
> > >              <---------  Dom0 copies skb and notify back to DomU
> > >   xennet_tx_buf_gc()
> > >   softirq/NET_TX : __kfree_skb()
> > 
> > Let me premise again that I am not an expert in PV netfront/netback.
> > However, I think the above is only true if DomU and Dom0 are running on
> > the same physical CPU. If you use sched=null as I suggested above,
> > you'll get domU and dom0 running at the same time on different physical
> > CPUs and the workflow doesn't work as described.
> > 
> > It should be:
> > 
> > CPU1: xennet_start_xmit()             ||  CPU2: doing something else
> > CPU1: notify Xen Dom0 to process skb  ||  CPU2: receive notification
> > CPU1: return from xennet_start_xmit() ||  CPU2: Dom0 copies skb
> > CPU1: do something else               ||  CPU2: notify back to DomU
> > CPU1: receive irq, xennet_tx_buf_gc() ||  CPU2: do something else
> 
> Yes, I agree this is ideal case.  I tried to set option "sched=null" but
> I can observe the latency in the second step when CPU1 notify Xen Dom0,
> Dom0 takes 500us+ to receive the notification.
> 
> Please see below detailed log:
> 
> DomU log:
> 
> 4989078512              pub-321   [003]   101.150966: bprint:               
> xennet_start_xmit: xennet_start_xmit: TSC: 4989078512
> 4989078573              pub-321   [003]   101.150968: bprint:               
> xennet_tx_setup_grant: id=24 ref=1816 offset=2 len=1514 TSC: 4989078573
> 4989078592              pub-321   [003]   101.150969: bprint:               
> xennet_start_xmit: xennet_notify_tx_irq: TSC: 4989078592
> 
> Dom0 log:
> 
> 4989092169           <idle>-0     [013]   140.121667: bprint:               
> xenvif_tx_interrupt: xenvif_tx_interrupt: TSC: 4989092169
> 4989092331           <idle>-0     [013]   140.121673: bprint:               
> xenvif_tx_build_gops.constprop.0: id=24 ref=1816 offset=2 len=1514 TSC: 
> 4989092331
> 
> We can see DomU sends notification with timestamp (raw counter) is
> 4989078592 and Dom0 receives the interrupt with timestamp 4989092169.
> Since Dom0 and DomU use the same time counter and the counter
> frequency is 25MHz, so we can get the delta value (in macroseconds):
> 
>     (4989092169 - 4989078592) / 25000000 * 1000 * 1000
>   = 543us
> 
> Which means it takes 543us to let Dom0 to receive the notification.
> You could see DomU runs in CPU3 and Dom0 runs on CPU13, there should
> not have contention for CPU resources.  Seems to me, it's likely Xen
> hypervisor takes long time to deliver the interrupt, note, it's not
> take so long time for every skb transferring, sometimes the time for
> response a notification is short (about ~10us).

Good find. I think this is worth investigating further. Do you have
vwfi=native in your Xen command line as well?

After that, I would add printk also in Xen with the timestamp. The event
channel notification code path is the following:

# domU side
xen/arch/arm/vgic-v2.c:vgic_v2_to_sgi
xen/arch/arm/vgic.c:vgic_to_sgi
xen/arch/arm/vgic.c:vgic_inject_irq
xen/arch/arm/vgic.c:vcpu_kick
xen/arch/arm/gic-v2.c:gicv2_send_SGI

# dom0 side
xen/arch/arm/gic.c:do_sgi
xen/arch/arm/traps.c:leave_hypervisor_to_guest

It would be good to understand why sometimes it takes ~10us and some
other times it takes ~540us


> > > > I didn't think that waiting for the backend is actually required. I
> > > > mean, in theory xennet_start_xmit could return without calling
> > > > xennet_tx_buf_gc, it is just an optimization. But I am not sure about
> > > > this.
> > > 
> > > The function xennet_start_xmit() will not wait and directly return
> > > back, but if we review the whole flow we can see the skb is freed until
> > > the softirq NET_TX.
> > 
> > Is it an issue that the skb is not freed until later? Is that affecting
> > the latency results? It shouldn't, right?
> 
> I did an extra experiment in Xen net forend driver, I enabled the flag
> "info->bounce = true" so the forend driver will use bounce buffer to
> store data and release the skb immediately to network core layer.
> 
> The throughput can be boosted significantly for this: the netperf
> result can be improved from 107.73 Mbits/s to 300+ Mbits/s.
> 
> > What matters is when dom0 is
> > getting those packets on the physical network interface and that happens
> > before the skb is freed. I am just trying to figure out if we are
> > focusing on the right problem.
> 
> Good point.  I agree that releasing skb earlier only can benefit for
> throughput, but we still cannot resolve the latency issue if Dom0
> takes long time to relay packets to phusical network interface.
> 
> > > In this whole flow, it needs DomU and Dom0 to work
> > > together (includes two context switches) to process skb.
> > 
> > There are not necessarily 2 context switches as things should run in
> > parallel.
> > 
> > > Here I mean the optimization is to allow Dom0 and DomU to work in
> > > parallel.  It could be something like blow, the key point is DomU
> > > doesn't need to wait for Dom0's notification.
> > 
> > I think it is already the case that domU doesn't need to wait for dom0's
> > notification?
> 
> Agree.  domU doesn't need to wait for dom0's notification until it uses
> out the skb can be allocated by the network core layer.  This is why I
> also can tweak core layer's parameters for buffer size (see
> /proc/sys/net/core/wmem_default and /proc/sys/net/core/wmem_max).
> 
> > It is true that domU is waiting for dom0's notification to
> > free the skb but that shouldn't affect latency?
> 
> Yeah.  I will focus on the elaberated issue above that Dom0 takes long
> time to receive the notification.
> 
> Will keep posted if have any new finding.

Thanks, this is very interesting



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.