[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 04/34] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration


  • To: David Woodhouse <dwmw2@xxxxxxxxxxxxx>, x86@xxxxxxxxxx, kvm@xxxxxxxxxxxxxxx, linux-doc@xxxxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, linux-kselftest@xxxxxxxxxxxxxxx
  • From: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
  • Date: Mon, 15 Jun 2026 23:47:39 -0700
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=PBjsNmogYJDFNJNERnWkReC1Z4B0usGw7EMqGw7tZJk=; b=ib48SsWPM9r+UonPCZcCNvaqXuqS1BJ8oprkyDSeH0XWxwx42uxBy68OHFDNKma/Oe0o8a9hvJNeI0E4i4YLgtsst854qbsyIiVBGsOhbSjswj3SOyBoemhHWTeHYAL5bh/Qyc9o3J2PEuSSPdFpBO4W6gIpk8zW1l+ALTSWwbJgycj8p+4ZZOsaR1p/yENb53w9tQv0YP00PoBslnRv67g/JR8i7ZQECy4/ajygbNtbrS84OXiety1W3QTaf0PxprSrkJpXnUtjRgf4BZvSeEt9waOBlBoDo3/c+h9SeTJn8cKZyhu6Im8Lo3gRxQOKytc1elhlIgXS2Cy2I/iQIg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Cd2ICzqx2/Y9HQdnST7U5KOo34EHX5qqTpFrejphSETyAS2LlfHl1iUsTfARKw/eqi9/7uOIetSjtMsrR4tDFOV19u93s7AIzXr6SHc6gd0moyHxTxvepx2QgK8gvKqAajS7IZGGzIvAO9C9vOuYqthfpzYuAeXILH0lqZ6NcD6CsOAdIpDCCgxHY4e/jnKoB08rUdkm6AT/f209DRv9FM658DngXIduy5qX6ERLaA8kcCjnNV51pvQ2gysYBhynnErwj/+Sa7uocQDDHIjqobxJRqpHjwyGFTA9jwOEsavqMK0J5/G/tBAVA5iR3uOQjKhcVZAcSpE5RuUAOgzMTw==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=corp-2025-04-25 header.d=oracle.com header.i="@oracle.com" header.h="Cc:Content-Transfer-Encoding:Content-Type:Date:From:In-Reply-To:Message-ID:MIME-Version:References:Subject:To"; dkim=pass header.s=selector2-oracle-onmicrosoft-com header.d=oracle.onmicrosoft.com header.i="@oracle.onmicrosoft.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"
  • Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>, Jonathan Corbet <corbet@xxxxxxx>, Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx>, Sean Christopherson <seanjc@xxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Jonathan Cameron <jic23@xxxxxxxxxx>, Marc Zyngier <maz@xxxxxxxxxx>, Sascha Bischoff <Sascha.Bischoff@xxxxxxx>, Jack Allister <jalliste@xxxxxxxxxx>, joe.jin@xxxxxxxxxx, Joey Gouly <joey.gouly@xxxxxxx>
  • Delivery-date: Tue, 16 Jun 2026 06:49:26 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

I tested patches 02, 03, 04, and 26 by customizing QEMU to support kexec live
updates (LUO and KHO), preserving the memfd across kexec.

For my use case, I used KVM_[GS]ET_CLOCK_GUEST instead of the existing
KVM_[GS]ET_CLOCK. I didn't account the downtime in my QEMU code, although host
TSC never resets across kexec.

Clock drift was zero, and I did not observe any unnecessary master clock updates
after KVM_SET_CLOCK_GUEST completed.


Another interesting observation from my experiments is that tsc_khz changes
across kexec. Since the TSC value itself does not reset across kexec, I'm
wondering whether there is any reason to switch to the new tsc_khz value after
the kexec.

I previously sent a QEMU patch that takes advantage of your KVM commit
ffbb61d09fc5 ("KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl.").

[PATCH 1/1] target/i386/kvm: set VM ioctl KVM_SET_TSC_KHZ to maintain TSC
synchronization
https://lore.kernel.org/qemu-devel/20260210202041.153736-1-dongli.zhang@xxxxxxxxxx


While live migration involves two different machines, kexec is performed on the
same machine. Given that the TSC value itself is preserved across kexec, would
it make sense to reuse the pre-kexec tsc_khz value instead of using the new
tsc_khz after kexec?

I tested this by using LUO to preserve tsc_khz across kexec, and the results
looked good.

Thank you very much!

Dongli Zhang

On 2026-06-08 7:47 AM, David Woodhouse wrote:
> From: Jack Allister <jalliste@xxxxxxxxxx>
> 
> In the common case (where kvm->arch.use_master_clock is true), the KVM
> clock is defined as a simple arithmetic function of the guest TSC, based
> on a reference point stored in kvm->arch.master_kernel_ns and
> kvm->arch.master_cycle_now.
> 
> The existing KVM_[GS]ET_CLOCK functionality does not allow for this
> relationship to be precisely saved and restored by userspace. All it can
> currently do is set the KVM clock at a given UTC reference time, which
> is necessarily imprecise.
> 
> So on live update, the guest TSC can remain cycle accurate at precisely
> the same offset from the host TSC, but there is no way for userspace to
> restore the KVM clock accurately.
> 
> Even on live migration to a new host, where the accuracy of the guest
> time-keeping is fundamentally limited by the accuracy of wallclock
> synchronization between the source and destination hosts, the clock jump
> experienced by the guest's TSC and its KVM clock should at least be
> *consistent*. Even when the guest TSC suffers a discontinuity, its KVM
> clock should still remain the *same* arithmetic function of the guest
> TSC, and not suffer an *additional* discontinuity.
> 
> To allow for accurate migration of the KVM clock, add per-vCPU ioctls
> which save and restore the actual PV clock info in
> pvclock_vcpu_time_info.
> 
> The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
> point in time just as kvm_update_masterclock() does, and calculating the
> corresponding guest TSC value. This guest TSC value is then passed
> through the user-provided pvclock structure to generate the *intended*
> KVM clock value at that point in time, and through the *actual* KVM
> clock calculation. Then kvm->arch.kvmclock_offset is adjusted to
> eliminate the difference.
> 
> Where kvm->arch.use_master_clock is false (because the host TSC is
> unreliable, or the guest TSCs are configured strangely), the KVM clock
> is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
> returns an error. In this case, as documented, userspace shall use the
> legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
> case since the clocks are imprecise in this mode anyway.
> 
> On *restoration*, if kvm->arch.use_master_clock is false, an error is
> returned for similar reasons and userspace shall fall back to using
> KVM_SET_CLOCK. This does mean that, as documented, userspace needs to
> use *both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results
> with the migration data (unless the intent is to refuse to resume on a
> host with bad TSC).
> 
> Co-developed-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> Signed-off-by: Jack Allister <jalliste@xxxxxxxxxx>
> Reviewed-by: Paul Durrant <paul@xxxxxxx>
> Cc: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
> ---
>  Documentation/virt/kvm/api.rst |  37 ++++++++
>  arch/x86/kvm/x86.c             | 164 +++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |   3 +
>  3 files changed, 204 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 52bbbb553ce1..2268b4442df6 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6553,6 +6553,43 @@ KVM_S390_KEYOP_SSKE
>    Sets the storage key for the guest address ``guest_addr`` to the key
>    specified in ``key``, returning the previous value in ``key``.
>  
> +4.145 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86_64
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks,
> +in precisely the form advertised to the guest vCPU, which gives parameters
> +for a direct conversion from a guest TSC value to nanoseconds.
> +
> +When the KVM clock is not in "master clock" mode, for example because the
> +host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
> +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
> +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
> +
> +4.146 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86_64
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
> +pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
> +arithmetic relationship between guest TSC and KVM clock to be preserved by
> +userspace across migration.
> +
> +When the KVM clock is not in "master clock" mode, and the KVM clock is 
> actually
> +defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL. 
> Userspace
> +may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or 
> may
> +choose to fail, denying migration to a host whose TSC is misbehaving.
> +
>  .. _kvm_run:
>  
>  5. The kvm_run structure
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d9ef165df6a1..b7e5f6e3dc6c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6205,6 +6205,162 @@ static int kvm_get_reg_list(struct kvm_vcpu *vcpu,
>       return 0;
>  }
>  
> +#ifdef CONFIG_X86_64
> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user 
> *argp)
> +{
> +     struct pvclock_vcpu_time_info hv_clock = {};
> +     struct kvm_vcpu_arch *vcpu = &v->arch;
> +     struct kvm_arch *ka = &v->kvm->arch;
> +     unsigned int seq;
> +
> +     /*
> +      * If KVM_REQ_CLOCK_UPDATE is already pending, or if the pvclock
> +      * has never been generated at all, call kvm_guest_time_update().
> +      */
> +     if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) || !vcpu->hw_tsc_hz) {
> +             int idx = srcu_read_lock(&v->kvm->srcu);
> +             int ret = kvm_guest_time_update(v);
> +
> +             srcu_read_unlock(&v->kvm->srcu, idx);
> +             if (ret)
> +                     return -EINVAL;
> +     }
> +
> +     /*
> +      * Reconstruct the pvclock from the master clock state, matching
> +      * exactly what kvm_guest_time_update() writes to the guest.
> +      */
> +     do {
> +             seq = read_seqcount_begin(&ka->pvclock_sc);
> +
> +             if (!ka->use_master_clock)
> +                     return -EINVAL;
> +
> +             hv_clock.tsc_timestamp = kvm_read_l1_tsc(v, 
> ka->master_cycle_now);
> +             hv_clock.system_time = ka->master_kernel_ns + 
> ka->kvmclock_offset;
> +     } while (read_seqcount_retry(&ka->pvclock_sc, seq));
> +
> +     hv_clock.tsc_shift = vcpu->pvclock_tsc_shift;
> +     hv_clock.tsc_to_system_mul = vcpu->pvclock_tsc_mul;
> +     hv_clock.flags = PVCLOCK_TSC_STABLE_BIT;
> +
> +     if (copy_to_user(argp, &hv_clock, sizeof(hv_clock)))
> +             return -EFAULT;
> +
> +     return 0;
> +}
> +
> +/*
> + * Reverse the calculation in the hv_clock definition.
> + *
> + * time_ns = ( (cycles << shift) * mul ) >> 32;
> + * (although shift can be negative, so that's bad C)
> + *
> + * So for a single second,
> + * NSEC_PER_SEC = ( ( FREQ_HZ << shift) * mul ) >> 32
> + * NSEC_PER_SEC << 32 = ( FREQ_HZ << shift ) * mul
> + * ( NSEC_PER_SEC << 32 ) / mul = FREQ_HZ << shift
> + * ( NSEC_PER_SEC << 32 ) / mul ) >> shift = FREQ_HZ
> + */
> +static u64 hvclock_to_hz(u32 mul, s8 shift)
> +{
> +     u64 tm = NSEC_PER_SEC << 32;
> +
> +     /* Maximise precision. Shift right until the top bit is set */
> +     tm <<= 2;
> +     shift += 2;
> +
> +     /* While 'mul' is even, increase the shift *after* the division */
> +     while (!(mul & 1)) {
> +             shift++;
> +             mul >>= 1;
> +     }
> +
> +     tm /= mul;
> +
> +     if (shift > 0)
> +             return tm >> shift;
> +     else
> +             return tm << -shift;
> +}
> +
> +static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user 
> *argp)
> +{
> +     struct pvclock_vcpu_time_info user_hv_clock;
> +     struct kvm *kvm = v->kvm;
> +     struct kvm_arch *ka = &kvm->arch;
> +     u64 curr_tsc_hz, user_tsc_hz;
> +     u64 user_clk_ns;
> +     u64 guest_tsc;
> +     int rc = 0;
> +
> +     if (copy_from_user(&user_hv_clock, argp, sizeof(user_hv_clock)))
> +             return -EFAULT;
> +
> +     if (user_hv_clock.pad0 || user_hv_clock.pad[0] || user_hv_clock.pad[1])
> +             return -EINVAL;
> +
> +     if (!user_hv_clock.tsc_to_system_mul)
> +             return -EINVAL;
> +
> +     if (user_hv_clock.tsc_shift < -32 || user_hv_clock.tsc_shift > 32)
> +             return -EINVAL;
> +
> +     user_tsc_hz = hvclock_to_hz(user_hv_clock.tsc_to_system_mul,
> +                                 user_hv_clock.tsc_shift);
> +
> +     kvm_hv_request_tsc_page_update(kvm);
> +
> +     /*
> +      * kvm_start_pvclock_update() takes tsc_write_lock and opens
> +      * the pvclock seqcount; kvm_end_pvclock_update() closes both.
> +      * All clock state modifications between them are atomic with
> +      * respect to readers in kvm_guest_time_update().
> +      */
> +     kvm_start_pvclock_update(kvm);
> +     pvclock_update_vm_gtod_copy(kvm);
> +
> +     if (!ka->use_master_clock) {
> +             rc = -EINVAL;
> +             goto out;
> +     }
> +
> +     curr_tsc_hz = (u64)get_cpu_tsc_khz() * 1000;
> +     if (unlikely(curr_tsc_hz == 0)) {
> +             rc = -EINVAL;
> +             goto out;
> +     }
> +
> +     if (kvm_caps.has_tsc_control)
> +             curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
> +                                         v->arch.l1_tsc_scaling_ratio);
> +
> +     /*
> +      * Allow for a discrepancy of 1 kHz either way between the TSC
> +      * frequency used to generate the user's pvclock and the current
> +      * host's measured frequency, since they may not precisely match.
> +      */
> +     if (user_tsc_hz < curr_tsc_hz - 1000 ||
> +         user_tsc_hz > curr_tsc_hz + 1000) {
> +             rc = -ERANGE;
> +             goto out;
> +     }
> +
> +     /*
> +      * Calculate the guest TSC at the new reference point, and the
> +      * corresponding KVM clock value according to user_hv_clock.
> +      * Adjust kvmclock_offset so both definitions agree.
> +      */
> +     guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
> +     user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
> +     ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
> +
> +out:
> +     kvm_end_pvclock_update(kvm);
> +     return rc;
> +}
> +#endif
> +
>  long kvm_arch_vcpu_ioctl(struct file *filp,
>                        unsigned int ioctl, unsigned long arg)
>  {
> @@ -6605,6 +6761,14 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>               srcu_read_unlock(&vcpu->kvm->srcu, idx);
>               break;
>       }
> +#ifdef CONFIG_X86_64
> +     case KVM_SET_CLOCK_GUEST:
> +             r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
> +             break;
> +     case KVM_GET_CLOCK_GUEST:
> +             r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
> +             break;
> +#endif
>  #ifdef CONFIG_KVM_HYPERV
>       case KVM_GET_SUPPORTED_HV_CPUID:
>               r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c8afa2047bf..9b50191b859c 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1669,4 +1669,7 @@ struct kvm_pre_fault_memory {
>       __u64 padding[5];
>  };
>  
> +#define KVM_SET_CLOCK_GUEST  _IOW(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST  _IOR(KVMIO, 0xd7, struct pvclock_vcpu_time_info)
> +
>  #endif /* __LINUX_KVM_H */




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.