[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: IRQ latency measurements in hypervisor
Hi Andrew, Andrew Cooper writes: > On 12/01/2021 23:48, Volodymyr Babchuk wrote: >> Hello community, >> >> Recently I was experimenting with context switching latencies in >> Xen. I want to share my findings and discuss some issues I have found. >> >> My target was to ensure that no more 30us passes between interrupt >> generation and handling in a guest VM. Main goal was to determine if >> Xen is ready for soft RT and to identify main issues that prevent >> this. >> >> TLDR: I was able to achieve mean 15us latency with tuned RTDS >> scheduler, disabled serial console and some hacks to the Xen >> toolstack. This was done on a setup with 1 PCPU. >> >> Test setup. >> >> ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board. >> >> To accurately determine latency, I employed one of timer counter units >> (TMUs) available on the SoC. This is 32-bit timer with auto-reload, >> that can generate interrupt on underflow. I fed it with 33.275MHz >> clock, which gave me resolution of about 30ns. I programmed the timer >> to generate interrupt every 10ms. My ISR then read the current timer >> value and determined how much time passed since last underrun. This >> gave me time interval between IRQ generation and ISR invocation. >> Those values were collected and every 10 seconds statistics was >> calculated. There is an example of output from my Linux driver: >> >> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 >> ns) >> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 >> ns) >> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 >> (3870 ns) >> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 >> ns) >> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 >> (5490 ns) >> >> This is the baremetal Linux. And there is Dom0: >> >> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 >> ns) >> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 >> ns) >> >> Driver outputs both the raw timer value (eg. 4) and the same value >> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is >> much faster. But experiments showed that Linux does not provide >> consistent values, even when running in baremetal mode. You can see >> sporadic spikes in "std dev" values. >> >> So my next step was to use proper RT OS to do the measurements. I >> chose Zephyr. My PR that adds Xen support to Zephyr can be found at >> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my >> GitHub([2]). At [3] you can find the source code for application that >> does the latency measurements. It behaves exactly as my linux driver, >> but provides a bit more information: >> >> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc *** >> RT Eval app >> >> Counter freq is 33280000 Hz. Period is 30 ns >> Set alarm in 0 sec (332800 ticks) >> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - >> 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)] >> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - >> 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)] >> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - >> 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)] >> ... >> >> So there you can see: mean time, standard deviation, % of interrupts >> that was processed above 30us threshold, minimum and maximum latency >> values for the current 10s run, global minimum and maximum. >> >> Zephyr running as baremetal showed very stable results (this is an >> older build, so no extended statistics there): >> >> ## Starting application at 0x480803C0 ... >> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 *** >> RT Eval app >> >> Counter freq is 33280000 Hz. Period is 30 ns >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> Mean: 31 (930 ns) stddev: 0 (0 ns) >> ... >> >> As Zephyr provided stable readouts with no jitter, I used it to do all >> subsequent measurements. >> >> >> IMPORTANT! All subsequent tests was conducted with only 1 CPU core >> enabled. My goal was to ensure that system can timely react to an >> external interrupt even under load. >> >> >> Test results and latency sources >> >> As you can see, baremetal OS provides latency of ~0.9us without any >> deviations. The same code running as DomU on idle system shows mean >> latency of 12us and deviation of about 10us. Range of latencies in a >> 10s batch can vary from 8us to 25us. This fits into required 30us >> threshold, so no big issue there. > > This is all very interesting work - thankyou for sharing. > > However, there is one set of measurements which I think are missing, and > are needed to describe the full picture. > > You've got measurements for a native kernel taking interrupts, with > excellent results for Zephyr, and good results for Linux. > > You've also got measurements for putting Xen into the mix, and the > results could do with some work. > > Adding Xen totally changes the structure of the system. Conceptually > (if the Xen model were expressed as native), what you've got now is a > single piece of userspace (pid1, no fork()ing), which is single threaded > (but doing its own private task switching between "thread"), where some > threads are making blocking system calls (all Xen hypercalls are fully > synchronous) while other threads are measuring the latency for a signal > to hit. Not quite. I used Zephyr to do the most of the measurements, and it ran as DomU. So, conceptually it was pid1 (Dom0) with a single thread(vCPU) and pid2 (Zephyr) with a single thread. Pid1 is doing some blocking calls and pid2 just waiting for its signal so it can measure latency. Thus, my setup more resembles system with two processes, not with one. > I think it would be incredibly valuable to mock something like that up > as a 3rd set of measurements, because it allows for a fairer comparison > of the rough structure of a virtualised system, with and without Xen > being involved. Well, I probably can put similar setup in Linux. This will allow me to compare Linux and Xen. Of course, it will be interesting, but I'm not sure that this comparison will provide any new information. I am examining Xen, to see if it can fit into some arbitrary requirements for soft RT. I doubt that Linux will behave better in this scenario, but even if it will - Linux and Xen are quite different, so it is not easy to "port" something scheduling-related from one to another. > > It would also be very valuable to see where the latency is in the > system. Would it be possible to set up a shared page, and have the > various moving parts write the counter value they see into it, so the > final stats can do a breakdown? > > You probably want counter samples from: > 1) When Xen has positively identified this single interrupt > 2) When the interrupt has been marked pending for the guest > 3) When the guest kernel has positively identified this single interrupt > 4) When the interrupt handler has run (this is the datapoint you've got, > AFAICT). I am using tracing infrastructure already available in Xen to do such measurements. I described in my reply to Julien how I am doing this. Of course, it covers only hypervisor part, so it is impossible to cover points 3 and 4. On other hand, zephyr is quite consistent in its own latency. Basically, it does nothing other than waiting that single interrupt. I assume it is safe to take baremetal measurements as a delay caused by Zephyr itself. > Maybe for point 2, you also want some metadata as to what context was > interrupted (EL0..2). > > You ought to see reasonably consistent results when interrupting EL0, > because that should transition immediately to EL1. For interrupting > EL1, you'll perhaps see variation depending on whether other higher > priority are being serviced. Yes, I am planing to do that sort of measurements. One downside is that you can use TRACE_XD macros only from C code, so it would not yield the most accurate result. On other hand, I am experimenting with hardware tracing. It is somewhat hard to configure, but I'm hope I'll be able to tracing configuration that does not cause trace buffer overflows. In this case, I will get the most accurate timing data possible. > For interrupting EL2, this is where I'd expect you to see all the > problems with hypercalls, or potentially other work Xen is doing. > Fundamentally, Xen can't just drop what it's doing and resume guest > context for the interrupt to be delivered. > > We get around this with the hypercall continuation mechanism, which is > conceptually similar to voluntary preemption (we periodically check to > see if we should stop doing what we're doing, unwind state safely, and > get back out into guest context). > > However, even using hypercall continuations, there will be a minimum > unit of work, necessary to guarantee forward progress, which cannot be > subdivided further. There is also a cost to having finer grain > continuation checks, but hopefully that is confined to the error margins. > I see. Another problem with hypercall continuations is that there is a single point of preemption. If hypercall invokes some long running function, that can't be split into smaller ones, like already mentioned alloc_heap_pages(), it is impossible to make it preemptible. > > Alternatively, if we still have problems, something to consider might be > the task domain concept which has been discussed before. This was an > idea to try and bring something more like normal tasks to Xen, by having > a hypervisor context that the control domain could queue work with, > who's running wouldn't be a synchronous hypercall from from the guest's > perspective. > > Such a task domain would require a mechanism to ensure it got scheduled > with a high enough priority not to get starved, and would probably > benefit if we restructured Xen to have fewer irq-off areas (which would > be a great thing irrespective - the scheduler lock in particular needs > to become irqs-off). In other words, the task domain concept is a lot > of work to do, with only a general hope of it resulting in an > improvement - it shouldn't be jumped upon until lightly to solve perf > problems. Task domain seems like an extra entity. Like, how it is better than just native Xen threads? E.g. linux kernel does not have separate process to hold own threads. As I see it, only one reason, why hypercall can't be preempted is the absence of place where its context can be stored. Thus, all this black magic of using hypercall arguments as a context storage. I'm sure that I am oversimplifying and that I'm missing some key aspect, but how it is hard to just allocate Xen stack for every vCPU and store its context there? All Xen code is already made thread-safe because of SMP. So why not just allow preemption at any point outside of atomic context? -- Volodymyr Babchuk at EPAM
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |