|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [PATCH 0/1] x86/pv: Split pv_hypercall() in two
Full perf anlaysis. Time is raw TSC cycles for a xen_version() hypercall,
compared across the change in patch 1, with obvious obvious outliers excluded.
i.e. Idealised best case improvement.
Some general notes. pv64 is `syscall`, while pv32 is `int $0x82` and
therefore has more overhead to begin with. Consequently, dropping two lfences
is less of an overall change in the path.
First, AMD Milan (Zen3):
$ ministat -A milan-hcall-pv64-{before,after}
x milan-hcall-pv64-before
+ milan-hcall-pv64-after
N Min Max Median Avg Stddev
x 98 420 460 440 438.97959 6.6564899
+ 98 360 440 380 370.81633 12.57337
Difference at 95.0% confidence
-68.1633 +/- 2.81674
-15.5277% +/- 0.641656%
(Student's t, pooled s = 10.0598)
$ ministat -A milan-hcall-pv32-{before,after}
x milan-hcall-pv32-before
+ milan-hcall-pv32-after
N Min Max Median Avg Stddev
x 98 1900 2100 1980 1984.2857 22.291416
+ 96 1740 1960 1760 1767.5 35.688713
Difference at 95.0% confidence
-216.786 +/- 8.35522
-10.9251% +/- 0.421069%
(Student's t, pooled s = 29.6859)
Second, AMD Naples (Zen1):
$ ministat -A naples-hcall-pv64-{before,after}
x naples-hcall-pv64-before
+ naples-hcall-pv64-after
N Min Max Median Avg Stddev
x 97 294 336 315 311.75258 10.207259
+ 97 252 273 252 257.41237 9.2328135
Difference at 95.0% confidence
-54.3402 +/- 2.73904
-17.4306% +/- 0.878593%
(Student's t, pooled s = 9.73224)
$ ministat -A naples-hcall-pv32-{before,after}
x naples-hcall-pv32-before
+ naples-hcall-pv32-after
N Min Max Median Avg Stddev
x 98 1260 1470 1260 1276.2857 42.913483
+ 95 1218 1470 1239 1250.9368 52.491298
Difference at 95.0% confidence
-25.3489 +/- 13.5082
-1.98614% +/- 1.0584%
(Student's t, pooled s = 47.8673)
Third, Intel Coffeelake-R:
$ ministat -A cflr-hcall-pv64-{before,after}
x cflr-hcall-pv64-before
+ cflr-hcall-pv64-after
N Min Max Median Avg Stddev
x 100 774 1024 792 825.04 73.608563
+ 95 734 966 756 787.74737 70.580114
Difference at 95.0% confidence
-37.2926 +/- 20.2602
-4.5201% +/- 2.45567%
(Student's t, pooled s = 72.1494)
$ ministat -A cflr-hcall-pv32-{before,after}
x cflr-hcall-pv32-before
+ cflr-hcall-pv32-after
N Min Max Median Avg Stddev
x 100 2176 3816 2198 2288.84 196.18218
+ 99 2180 2434 2198 2232.4646 75.867677
Difference at 95.0% confidence
-56.3754 +/- 41.4084
-2.46305% +/- 1.80914%
(Student's t, pooled s = 149.013)
Fourth, Intel Skylake Server:
$ ministat -A skx-hcall-pv64-{before,after}
x skx-hcall-pv64-before
+ skx-hcall-pv64-after
N Min Max Median Avg Stddev
x 99 5642 5720 5686 5686.303 17.909896
+ 98 5520 5544 5540 5536.0816 8.20821
Difference at 95.0% confidence
-150.221 +/- 3.89729
-2.64181% +/- 0.0685382%
(Student's t, pooled s = 13.9542)
$ ministat -A skx-hcall-pv32-{before,after}
x skx-hcall-pv32-before
+ skx-hcall-pv32-after
N Min Max Median Avg Stddev
x 99 9296 9500 9308 9309.3131 20.418402
+ 96 9110 9266 9180 9175.2292 27.860358
Difference at 95.0% confidence
-134.084 +/- 6.84111
-1.44032% +/- 0.0734868%
(Student's t, pooled s = 24.3673)
I'm honestly not sure why Naples PV32's improvement is so small, but I've
double checked the numbers. Clearly there's something on the `int $0x82` path
which is radically higher overhead on Naples vs Milan.
For the Intel numbers, both setups are writing to MSR_SPEC_CTRL on entry/exit,
but for Skylake it is the microcode implementation whereas for CLF-R, it is
the hardware implemenation. Skylake has XPTI adding further overhead to the
paths.
Andrew Cooper (1):
x86/pv: Split pv_hypercall() in two
xen/arch/x86/pv/hypercall.c | 24 +++++++++++++++++++-----
xen/arch/x86/pv/traps.c | 11 -----------
2 files changed, 19 insertions(+), 16 deletions(-)
--
2.11.0
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |