[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 2/2] x86/Intel: virtualize support for cpuid faulting

On Mon, Oct 24, 2016 at 8:05 AM, Boris Ostrovsky
<boris.ostrovsky@xxxxxxxxxx> wrote:
> On 10/24/2016 12:18 AM, Kyle Huey wrote:
>> The anomalies we see appear to be related to, or at least triggerable
>> by, the performance monitoring interrupt.  The following program runs
>> a loop of roughly 2^25 conditional branches.  It takes one argument,
>> the number of conditional branches to program the PMI to trigger on.
>> The default is 50,000, and if you run the program with that it'll
>> produce the same value every time.  If you drop it to 5000 or so
>> you'll probably see occasional off-by-one discrepancies.  If you drop
>> it to 500 the performance counter values fluctuate wildly.
> Yes, it does change but I also see the difference on baremetal (although
> not as big as it is in an HVM guest):
> ostr@workbase> ./pmu 500
> Period is 500
> Counted 5950003 conditional branches
> ostr@workbase> ./pmu 500
> Period is 500
> Counted 5850003 conditional branches
> ostr@workbase> ./pmu 500
> Period is 500
> Counted 7530107 conditional branches
> ostr@workbase>

Yeah, you're right.  I simplified the testcase too far.  I have
included a better one.  This testcase is stable on bare metal (down to
an interrupt every 10 branches, I didn't try below that) and more
accurately represents what our software actually does.  rr acts as a
ptrace supervisor to the process being recorded, and it seems that
context switching between the supervisor and tracee processes
stabilizes the performance counter values somehow.

>> I'm not yet sure if this is specifically related to the PMI, or if it
>> can be caused by any interrupt and it's only how frequently the
>> interrupts occur that matters.
> I have never used file interface to performance counters, but what are
> we reporting here (in read_counter()) --- total number of events or
> number of events since last sample? It is also curious to me that the
> counter in non-zero after  PERF_EVENT_IOC_RESET (but again, I don't have
> any experience with these interfaces).

It should be number of events since the last time the counter was
reset (or overflowed, I guess).  On my machine the counter value is
zero both before and after the PERF_EVENT_IOC_RESET ioctl.

> Also, exclude_guest doesn't appear to make any difference, I don't know
> if there are any bits in Intel counters that allow you to distinguish
> guest from host (unlike AMD, where there is a bit for that).

exclude_guest is a Linux specific thing for excluding KVM guests.
There is no hardware support involved; it's handled entirely in the
perf events infrastructure in the kernel.

- Kyle

#define _GNU_SOURCE 1

#include <assert.h>
#include <fcntl.h>
#include <linux/perf_event.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/ptrace.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <unistd.h>

static struct perf_event_attr rcb_attr;
static uint64_t period;
static int fd;

void counter_on(uint64_t ticks)
  int ret = ioctl(fd, PERF_EVENT_IOC_RESET, 0);
  ret = ioctl(fd, PERF_EVENT_IOC_PERIOD, &ticks);
  ret = ioctl(fd, PERF_EVENT_IOC_ENABLE, 1);

void counter_off()
  int ret = ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

int64_t read_counter()
  int64_t val;
  ssize_t nread = read(fd, &val, sizeof(val));
  assert(nread == sizeof(val));
  return val;

void do_test()
  int i, dummy;

  for (i = 0; i < (1 << 25); i++) {
    dummy += i % (1 << 10);
    dummy += i % (79 * (1 << 10));

int main(int argc, const char* argv[])
  int pid;
  memset(&rcb_attr, 0, sizeof(rcb_attr));
  rcb_attr.size = sizeof(rcb_attr);
  rcb_attr.type = PERF_TYPE_RAW;
  /* Intel retired conditional branches counter, ring 3 only */
  rcb_attr.config = 0x5101c4;
  rcb_attr.exclude_kernel = 1;
  rcb_attr.exclude_guest = 1;
  /* We'll change this later */
  rcb_attr.sample_period = 0xffffffff;

  signal(SIGALRM, SIG_IGN);
  pid = fork();
  if (pid == 0) {
    /* Wait for the parent */
    kill(getpid(), SIGSTOP);
    return 0;

  /* start the counter */
  fd = syscall(__NR_perf_event_open, &rcb_attr, pid, -1, -1, 0);
  if (fd < 0) {
    printf("Failed to initialize counter\n");
    return -1;


  struct f_owner_ex own;
  own.type = F_OWNER_PID;
  own.pid = pid;
  if (fcntl(fd, F_SETOWN_EX, &own) ||
      fcntl(fd, F_SETFL, O_ASYNC) ||
      fcntl(fd, F_SETSIG, SIGALRM)) {
    printf("Failed to make counter async\n");
    return -1;

  period = 50000;
  if (argc > 1) {
    sscanf(argv[1], "%ld", &period);

  printf("Period is %ld\n", period);

  ptrace(PTRACE_SEIZE, pid, NULL, 0);
  ptrace(PTRACE_CONT, pid, NULL, SIGCONT);

  int status = 0;
  while (1) {
    waitpid(pid, &status, 0);
    if (WIFEXITED(status)) {
    if (WIFSIGNALED(status)) {
    if (WIFSTOPPED(status)) {
      if (WSTOPSIG(status) == SIGALRM ||
      WSTOPSIG(status) == SIGSTOP) {
    ptrace(PTRACE_CONT, pid, NULL, WSTOPSIG(status));
    assert(0 && "unhandled ptrace event!");

  int64_t counts = read_counter();
  printf("Counted %ld conditional branches\n", counts);

  return 0;

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.