Xen project Mailing List

Re: IRQ latency measurements in hypervisor

To: Stefano Stabellini <sstabellini@xxxxxxxxxx>

From: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>

Date: Thu, 21 Jan 2021 00:49:23 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=s0Xm/32jGIPz8KN+oPdtx8ZnBtF1eky9NKgbQhTBNfg=; b=Ha6cEuycgjKmyUEuVn/vL/7IUgw+j1kdIpJh6jUaKvkTThTvmPESNb/HW5O+yqftwRtmQxrcExGQsm6l6/NshA9hGNsHxfmQoOYr8Oe6v7Tsw4LHzfC1o+JhyMp06l8Li1agXbSxTi9kbqa+1wRpTBIwKS7jPRTJfUeZK+KhLzYQLQEsDQstpqaA5W/RRQL9s91tneK+DbEIpdncaalgIA6FDxcUARpluLSOhtJzefJula+5FFA+c7ptVd9HezAku56Y8drd/kJ74vijtj0tqbFSMZfLAEZjja5wB+6DUVpGMDIg0sTCAW90Aj6HyXvtl4ReIv4Mi5NT5DfmOW+omA==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=SHOVDhqF8bv/PUinqCZU7StXCbFF8C3Whs3PK25ANOcdAh7wxEf5uK5tWXF6T7Vp2YkkwcG6Wf587Q+hd94OAKa2YviuE3nIO5Au3gMUntCsuowXMUdLKx1isF/iSnJERUIU+SufgkpIyXMZi7nxbIvyR4MFq2ZE1IZORorg0rjn2YWm7EBmytVp4ilpL+Jf3E4qPWsoqfDy+++VqbX5TwhsiRA2go2a0iQ9/u3VWO7VwKC/ufnJ6DUpAc8h5c+aDHgQ7Tuciqcepogw9Os+EidK5XJ1FhgZ53EiRVeLSoDTbGNJuJCgTtaigDokbTxuNAarzs5zw0vptkNmBNumCA==

Authentication-results: kernel.org; dkim=none (message not signed) header.d=none;kernel.org; dmarc=none action=none header.from=epam.com;

Cc: Stefano Stabellini <stefano.stabellini@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Julien Grall <jgrall@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxx>, "Bertrand.Marquis@xxxxxxx" <Bertrand.Marquis@xxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>

Delivery-date: Thu, 21 Jan 2021 00:49:53 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHW6T1firbO0EOYpEKUiW6vsavsPaonyNIAgAEKgACAAINEgIAH9XMA

Thread-topic: IRQ latency measurements in hypervisor

Stefano Stabellini writes: > On Fri, 15 Jan 2021, Volodymyr Babchuk wrote: [...] >> >> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 >> >> (240 ns) >> >> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 >> >> (240 ns) >> >> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 >> >> (3870 ns) >> >> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 >> >> (240 ns) >> >> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 >> >> (5490 ns) >> >> >> >> This is the baremetal Linux. And there is Dom0: >> >> >> >> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 >> >> (750 ns) >> >> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 >> >> (510 ns) >> >> >> >> Driver outputs both the raw timer value (eg. 4) and the same value >> >> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is >> >> much faster. But experiments showed that Linux does not provide >> >> consistent values, even when running in baremetal mode. You can see >> >> sporadic spikes in "std dev" values. >> > >> > So baremetal IRQ latency is 1320-1680ns and Linux IRQ latency is >> > 9060-9180ns. I am not surprised that Linux results are inconsistent but >> > I have a couple of observations: >> > >> > - 9us is high for Linux >> > If the system is idle, the latency should be lower, around 2-3us. I >> > imagine you are actually running some sort of interference from dom0? Or >> > using RTDS and descheduling vCPUs? >> >> 9us was in idle state. Interestingly enough, I got latency if 3us while >> Dom0 was doing some CPU-intensive tasks. So, under load latency is lower >> than in idle state. I didn't investigated this, so I can't tell you what >> causes this behavior. > > Did you use vwfi=native? You should definitely be able to see ~3us > without interference and with 1vCPU <-> 1pCPU With vwfi=native things in Dom0 got better, but it is still not 3us: [ 41.563904] rt_eval_tmu e6fc0000.tmu: Mean: 173 (5190 ns) stddev: 15 (450 ns) min: 125 (3750 ns) max: 276102 (8283060 ns) [ 51.820403] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 8 (240 ns) min: 125 (3750 ns) max: 276102 (8283060 ns) [ 62.076898] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 9 (270 ns) min: 125 (3750 ns) max: 276102 (8283060 ns) [ 72.333377] rt_eval_tmu e6fc0000.tmu: Mean: 171 (5130 ns) stddev: 9 (270 ns) min: 125 (3750 ns) max: 276102 (8283060 ns) [ 82.589852] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 10 (300 ns) min: 125 (3750 ns) max: 276102 (8283060 ns) And DomU (as Julien suggested) is extremely unhappy: Mean: 282 (8460 ns) stddev: 412 (12360 ns) above thr: 4% [109 (3270 ns) - 2245 (67350 ns)] global [108 (3240 ns) 74261 (2227830 ns)] Mean: 958 (28740 ns) stddev: 860 (25800 ns) above thr: 50% [120 (3600 ns) - 2253 (67590 ns)] global [108 (3240 ns) 74261 (2227830 ns)] Mean: 713 (21390 ns) stddev: 533 (15990 ns) above thr: 37% [114 (3420 ns) - 2186 (65580 ns)] global [108 (3240 ns) 74261 (2227830 ns)] > >> > - the stddev of 3870ns is high for baremetal >> > In the baremetal case the stddev should be minimal if the system is >> > idle. >> >> This is what I expected too. But nevertheless there was spikes. I didn't >> investigated this as well, so I can only speculate there. My rootfs is >> on NFS, so maybe network driver caused this spikes. > > Yeah, maybe it would be best to reduce the sources of possible spikes > and get rid of NFS. I agree there, but I am working remotely, so NFS is easier. I'll try with eMMC later. > >> > >> > >> >> So my next step was to use proper RT OS to do the measurements. I >> >> chose Zephyr. My PR that adds Xen support to Zephyr can be found at >> >> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my >> >> GitHub([2]). At [3] you can find the source code for application that >> >> does the latency measurements. It behaves exactly as my linux driver, >> >> but provides a bit more information: >> >> >> >> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc *** >> >> RT Eval app >> >> >> >> Counter freq is 33280000 Hz. Period is 30 ns >> >> Set alarm in 0 sec (332800 ticks) >> >> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 >> >> ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)] >> >> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) >> >> - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)] >> >> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) >> >> - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)] >> >> This is Zephyr running as DomU. >> >> >> ... >> >> >> >> So there you can see: mean time, standard deviation, % of interrupts >> >> that was processed above 30us threshold, minimum and maximum latency >> >> values for the current 10s run, global minimum and maximum. >> >> >> >> Zephyr running as baremetal showed very stable results (this is an >> >> older build, so no extended statistics there): >> >> >> >> ## Starting application at 0x480803C0 ... >> >> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 *** >> >> RT Eval app >> >> >> >> Counter freq is 33280000 Hz. Period is 30 ns >> >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> >> And this is Zephyr is running as baremetal. >> >> >> ... >> >> >> >> As Zephyr provided stable readouts with no jitter, I used it to do all >> >> subsequent measurements. >> > >> > I am a bit confused here. Looking at the numbers above the stddev is >> > 112110 ns in the first instance. That is pretty high. Am I looking at >> > the wrong numbers? >> >> I added some clarification above. As for 112110ns in the very first instance >> - I always ignored the first instance, assuming that things need to >> settle after domain being created. >> >> But your comment is actually correct: what exacelt should "settle"? >> Domain is already created. All should run smoothly. So, this is worth >> investigating. > > It is fair to ignore the first 2 measurements. > > However, the numbers above have high stddev even at the 3rd measurement: > 53us is high and above the target 30us. I take you didn't apply yet the > alloc_heap_pages and the serial workarounds? Yes, this is the very first try. Without workarounds and proper scheduler configuration. With all fixes it is much better: Mean: 307 (9210 ns) stddev: 4 (120 ns) above thr: 0% [265 (7950 ns) - 376 (11280 ns)] global [265 (7950 ns) 376 (11280 ns)] Mean: 306 (9180 ns) stddev: 3 (90 ns) above thr: 0% [263 (7890 ns) - 360 (10800 ns)] global [263 (7890 ns) 376 (11280 ns)] Mean: 306 (9180 ns) stddev: 3 (90 ns) above thr: 0% [298 (8940 ns) - 373 (11190 ns)] global [263 (7890 ns) 376 (11280 ns)] Mean: 307 (9210 ns) stddev: 17 (510 ns) above thr: 0% [265 (7950 ns) - 856 (25680 ns)] global [263 (7890 ns) 1251 (37530 ns)] Mean: 304 (9120 ns) stddev: 7 (210 ns) above thr: 0% [286 (8580 ns) - 486 (14580 ns)] global [263 (7890 ns) 1251 (37530 ns)] Mean: 307 (9210 ns) stddev: 47 (1410 ns) above thr: 0% [260 (7800 ns) - 1299 (38970 ns)] global [260 (7800 ns) 1299 (38970 ns)] Mean: 291 (8730 ns) stddev: 41 (1230 ns) above thr: 0% [250 (7500 ns) - 1324 (39720 ns)] global [250 (7500 ns) 1324 (39720 ns)] > >> >> IMPORTANT! All subsequent tests was conducted with only 1 CPU core >> >> enabled. My goal was to ensure that system can timely react to an >> >> external interrupt even under load. >> > >> > All right. FYI I have no frame of reference for 2 vCPUs on 1 pCPUs, all >> > my tests were done with 1vCPU <-> 1pCPU and the null scheduler. >> >> As I said, I had no issues with 1vCPU <-> 1pCPU setup, so I quickly >> moved to cases which had issues. >> >> >> Test results and latency sources >> >> >> >> As you can see, baremetal OS provides latency of ~0.9us without any >> >> deviations. The same code running as DomU on idle system shows mean >> >> latency of 12us and deviation of about 10us. Range of latencies in a >> >> 10s batch can vary from 8us to 25us. This fits into required 30us >> >> threshold, so no big issue there. >> >> >> >> But this worsens under certain conditions. >> >> >> >> 1. Serial console. RCAR serial driver (scif) works in synchronous >> >> mode, so any heavy serial console output leads to higher >> >> latency. Tests shows mean latency of 1000us and deviation of 1332 >> >> us. 54% of interrupts are handled outside of 30us period. Real >> >> values may be even higher, because in some cases timer would do >> >> full cycle and instead of say 11ms ISR would read 11 % 10 = 1ms >> >> latency. I tried to enable asynchronous mode for the serial >> >> console. This made things better, but it lead to gaps in output, so >> >> I just turned the serial console off completely. >> > >> > That's very interesting. I wonder if other serial drivers would cause >> > similar issues, e.g. PL011. >> >> They should. This behavior is programmed in serial.c. Driver can enable >> async mode calling serial_async_transmit(). As I can see, only ns16550 >> driver does this. >> >> Maybe you didn't saw problems there because you had more pCPU enabled >> and Xen used some other pCPU to do UART operations. > > Good to know. A workaround would be to let Dom0/DomU have direct access > to the UART, not giving any UART to Xen at all. Yes. Or use SSH. I chose the second way. [...] -- Volodymyr Babchuk at EPAM

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.