[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Patch] x86/HVM: Fix RTC interrupt modelling



On 10/02/14 16:34, Jan Beulich wrote:
>>>> On 10.02.14 at 12:17, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>> This reverts large amounts of:
>>   9607327abbd3e77bde6cc7b5327f3efd781fc06e
>>     "x86/HVM: properly handle RTC periodic timer even when !RTC_PIE"
>>   620d5dad54008e40798c4a0c4322aef274c36fa3
>>     "x86/HVM: assorted RTC emulation adjustments"
>>
>> and by extentsion:
>>   f3347f520cb4d8aa4566182b013c6758d80cbe88
>>     "x86/HVM: adjust IRQ (de-)assertion"
>>   c2f79c464849e5f796aa9d1d0f26fe356abd1a1a
>>     "x86/HVM: fix processing of RTC REG_B writes"
>>   527824f41f5fac9cba3d4441b2e73d3118d98837
>>     "x86/hvm: Centralize and simplify the RTC IRQ logic."
> So what does "by extension" mean here? Are these being
> reverted?

The logic here was based on the logic being reverted, so the changes
themselves are mostly gone.

>
>> The current code has a pathological case, tickled by the access pattern of
>> Windows 2003 Server SP2.  Occasonally on boot (which I presume is during a
>> time calibration against the RTC Periodic Timer), Windows gets stuck in an
>> infinite loop reading RTC REG_C.  This affects 32 and 64 bit guests.
>>
>> In the pathological case, the VM state looks like this:
>>   * RTC: 64Hz period, periodic interrupts enabled
>>   * RTC_IRQ in IOAPIC as vector 0xd1, edge triggered, not pending
>>   * vector 0xd1 set in LAPIC IRR and ISR, TPR at 0xd0
>>   * Reads from REG_C return 'RTC_PF | RTC_IRQF'
>>
>> With an intstrumented Xen, dumping the periodic timers with a guest in this
>> state shows a single timer with pt->irq_issued=1 and pt->pending_intr_nr=2.
>>
>> Windows is presumably waiting for reads of REG_C to drop to 0, and reading
>> REG_C clears the value each time in the emulated RTC.  However:
>>
>>   * {svm,vmx}_intr_assist() calls pt_update_irq() unconditionally.
>>   * pt_update_irq() always finds the RTC as earliest_pt.
>>   * rtc_periodic_interrupt() unconditionally sets RTC_PF in no_ack mode.  It
>>     returns true, indicating that pt_update_irq() should really inject the
>>     interrupt.
>>   * pt_update_irq() decides that it doesn't need to fake up part of
>>     pt_intr_post() because this is a real interrupt.
>>   * {svm,vmx}_intr_assist() can't inject the interrupt as it is already
>>     pending, so exits early without calling pt_intr_post().
>>
>> The underlying problem here comes because the AF and UF bits of RTC 
>> interrupt
>> state is modelled by the RTC code, but the PF is modelled by the pt code.  
>> The
>> root cause of windows infinite loop is that RTC_PF is being re-set on 
>> vmentry
>> before the interrupt logic has worked out that it can't actually inject an 
>> RTC
>> interrupt, causing Windows to erroniously read (RTC_PF|RTC_IRQF) when it
>> should be reading 0.
> So you're undoing a whole lot of changes done with the goal of
> getting the overall emulation closer to what real hardware does,
> just to paper over an issue elsewhere in the code? Not really an
> approach I'm in favor of.

I fail to see how the current code is closer to what hardware does than
this proposed patch.

>
>>   * The results from XenRT suggest that the new emulation is better than the
>>     old.
> "Better" in the sense of the limited set of uses of the virtual hardware
> by whatever selection of guest OSes is being run there. But very
> likely not "better" in the sense on matching up with how the respective
> hardware specification would require it to behave.
>
>> --- a/xen/arch/x86/hvm/rtc.c
>> +++ b/xen/arch/x86/hvm/rtc.c
>> @@ -59,48 +59,78 @@ static void rtc_set_time(RTCState *s);
>>  static inline int from_bcd(RTCState *s, int a);
>>  static inline int convert_hour(RTCState *s, int hour);
>>  
>> -static void rtc_update_irq(RTCState *s)
>> +/*
>> + * Send an edge on the RTC ISA IRQ line.  The RTC spec states that it should
>> + * be a line level interrupt, but the PIIX3 states that it must be edge
>> + * triggered.  We model the RTC using edge semantics.
>> + */
>> +static void rtc_toggle_irq(RTCState *s)
>>  {
>> +    hvm_isa_irq_deassert(vrtc_domain(s), RTC_IRQ);
>> +    hvm_isa_irq_assert(vrtc_domain(s), RTC_IRQ);
>> +}
>> +
>> +static void rtc_update_regb(RTCState *s, uint8_t new_b)
>> +{
>> +    uint8_t new_c = s->hw.cmos_data[RTC_REG_C] & ~RTC_IRQF;
>> +
>>      ASSERT(spin_is_locked(&s->lock));
>>  
>> -    if ( rtc_mode_is(s, strict) && (s->hw.cmos_data[RTC_REG_C] & RTC_IRQF) )
>> -        return;
>> +    if ( new_b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
>> +        new_c |= RTC_IRQF;
> Without going back to reading the spec, iirc RTC_IRQF is a sticky
> bit cleared _only_ be REG_C reads, i.e. you shouldn't clear it
> earlier in the function and then conditionally set it here.

The main problem is that the MC146818 states that the interrupt is line
level, while the PIIX3 mandates that it is edge triggered, and our ACPI
tables define it to be edge triggered.

The datasheet states "The IRQF bit in Register C is a '1' whenever the
¬IRQ pin is being driven low".  With edge semantics where the line never
actually stays asserted, this would degrade to being sticky until read.

However, with edge semantics, we need to send new interrupts when
enabling bits in REG_B even if IRQF is already outstanding, which is why
the logic starts by masking it back out.  Furthermore, in no-ack mode,
we expect IRQF to be set (as the guest isn't reading REG_C), but still
wanting interrupts.

Perhaps IRQF should be or'd back together when writing the state back.

>
>>  
>> -    /* IRQ is raised if any source is both raised & enabled */
>> -    if ( !(s->hw.cmos_data[RTC_REG_B] &
>> -           s->hw.cmos_data[RTC_REG_C] &
>> -           (RTC_PF | RTC_AF | RTC_UF)) )
>> -        return;
>> +    s->hw.cmos_data[RTC_REG_B] = new_b;
>> +    s->hw.cmos_data[RTC_REG_C] = new_c;
>>  
>> -    s->hw.cmos_data[RTC_REG_C] |= RTC_IRQF;
>> -    if ( rtc_mode_is(s, no_ack) )
>> -        hvm_isa_irq_deassert(vrtc_domain(s), RTC_IRQ);
>> -    hvm_isa_irq_assert(vrtc_domain(s), RTC_IRQ);
>> +    if ( new_c & RTC_IRQF )
>> +        rtc_toggle_irq(s);
> Which then implies that the condition here would also need to
> consider the old state of the flag.
>
>> +static void rtc_irq_event(RTCState *s, uint8_t event)
>> +{
>> +    uint8_t b = s->hw.cmos_data[RTC_REG_B];
>> +    uint8_t old_c = s->hw.cmos_data[RTC_REG_C];
>> +    uint8_t new_c = old_c & ~RTC_IRQF;
>> +
>> +    ASSERT(spin_is_locked(&s->lock));
>> +
>> +    if ( b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
>> +        new_c |= RTC_IRQF;
> Same comment as above.
>
>> @@ -647,9 +674,6 @@ static uint32_t rtc_ioport_read(RTCState *s, uint32_t 
>> addr)
>>      case RTC_REG_C:
>>          ret = s->hw.cmos_data[s->hw.cmos_index];
>>          s->hw.cmos_data[RTC_REG_C] = 0x00;
>> -        if ( (ret & RTC_IRQF) && !rtc_mode_is(s, no_ack) )
>> -            hvm_isa_irq_deassert(d, RTC_IRQ);
> Why? With RTC_IRQF going from 1 to 0, the interrupt line should
> get de-asserted.
>
>> -        rtc_update_irq(s);
> So given the problem description, this would seem to be the most
> important part at a first glance. But looking more closely, I'm getting
> the impression that the call to rtc_update_irq() had no effect at all
> here anyway: The function would always bail on the second if() due
> to REG_C having got cleared a few lines up.
>
>> @@ -270,47 +263,12 @@ int pt_update_irq(struct vcpu *v)
>>      earliest_pt->irq_issued = 1;
>>      irq = earliest_pt->irq;
>>      is_lapic = (earliest_pt->source == PTSRC_lapic);
>> -    pt_priv = earliest_pt->priv;
>>  
>>      spin_unlock(&v->arch.hvm_vcpu.tm_lock);
>>  
>>      if ( is_lapic )
>> -        vlapic_set_irq(vcpu_vlapic(v), irq, 0);
>> -    else if ( irq == RTC_IRQ && pt_priv )
>>      {
>> -        if ( !rtc_periodic_interrupt(pt_priv) )
>> -            irq = -1;
>> -
>> -        pt_lock(earliest_pt);
>> -
>> -        if ( irq < 0 && earliest_pt->pending_intr_nr )
>> -        {
>> -            /*
>> -             * RTC periodic timer runs without the corresponding interrupt
>> -             * being enabled - need to mimic enough of pt_intr_post() to 
>> keep
>> -             * things going.
>> -             */
>> -            earliest_pt->pending_intr_nr = 0;
>> -            earliest_pt->irq_issued = 0;
>> -            set_timer(&earliest_pt->timer, earliest_pt->scheduled);
>> -        }
>> -        else if ( irq >= 0 && pt_irq_masked(earliest_pt) )
>> -        {
>> -            if ( earliest_pt->on_list )
>> -            {
>> -                /* suspend timer emulation */
>> -                list_del(&earliest_pt->list);
>> -                earliest_pt->on_list = 0;
>> -            }
>> -            irq = -1;
>> -        }
>> -
>> -        /* Avoid dropping the lock if we can. */
>> -        if ( irq < 0 && v == earliest_pt->vcpu )
>> -            goto rescan_locked;
>> -        pt_unlock(earliest_pt);
>> -        if ( irq < 0 )
>> -            goto rescan;
>> +        vlapic_set_irq(vcpu_vlapic(v), irq, 0);
> If you didn't put this single function call in braces, the patch would
> become more clear, as it would then be exactly the "else if()"
> branch that got removed by it.

Right - This part of the patch was produced with reversion alone, but
dropping the braces would make it clearer.

>
> As a round-up: I'm not going to veto this, but I'm also not going to
> be putting my name under it, nor am I going to make another
> attempt to clean up the RTC emulation if this is to go in unchanged.
> I'm personally getting the impression that the root cause of the
> observed problem is still being left in place (and perhaps still not
> being fully understood), and hence this whole change goes in the
> wrong direction, _even_ if it makes the problem it is aiming at
> fixing indeed appear to go away.
>
> Jan

I am not sure I follow you.  The root case of the infinite loop is
because Xen was erroniously setting REG_C.PF, because pt_update_irq()
was trying to pre-guess what the interrupt injection logic would do,
(and getting it wrong).

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.