[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] "vmalloc sync one failure" and host lockup



>>> On 11.03.13 at 18:54, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx> wrote:
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -206,8 +206,18 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, 
> unsigned long address)
>  
>       if (!pmd_present(*pmd))
>               set_pmd(pmd, *pmd_k);
> -     else
> +     else {
> +             printk(KERN_CRIT "vmalloc sync one failure for %#lx\n", 
> address);
> +             printk(KERN_CRIT "  pgd %p = %#010llx\n", pgd, pgd_val(*pgd));
> +             printk(KERN_CRIT "pgd_k %p = %#010llx\n", pgd_k, 
> pgd_val(*pgd_k));
> +             printk(KERN_CRIT "  pud %p = %#010llx\n", pud, pud_val(*pud));
> +             printk(KERN_CRIT "pud_k %p = %#010llx\n", pud_k, 
> pud_val(*pud_k));
> +             printk(KERN_CRIT "  pmd %p = %#010llx\n", pmd, pmd_val(*pmd));
> +             printk(KERN_CRIT "pmd_k %p = %#010llx\n", pmd_k, 
> pmd_val(*pmd_k));
> +             printk(KERN_CRIT "pmd page   %p\n", pmd_page(*pmd));
> +             printk(KERN_CRIT "pmd_k page %p\n", pmd_page(*pmd_k));

All of these printk()-s should be made conditional upon the BUG_ON()
condition below (so perhaps the else above should become an "else
if"). Otherwise, the one possible real failure will be lost in the noise
of all the pointless printk()-s.

>               BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k));
> +     }
>  
>       return pmd_k;
>  }
> @@ -229,15 +239,15 @@ void vmalloc_sync_all(void)
>               spin_lock_irqsave(&pgd_lock, flags);
>               list_for_each_entry(page, &pgd_list, lru) {
>                       spinlock_t *pgt_lock;
> -                     int ret;
> +                     pmd_t *pmd;

This change in particular I find very strange: For one the code in
3.9-rc2 here is

                spin_lock(&pgd_lock);
                list_for_each_entry(page, &pgd_list, lru) {
                        spinlock_t *pgt_lock;
                        pmd_t *ret;

(i.e. the patch wouldn't even apply). And the type change clearly
ought to make the compiler choke on ...

>  
>                       pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
>  
>                       spin_lock(pgt_lock);
> -                     ret = vmalloc_sync_one(page_address(page), address);
> +                     pmd = vmalloc_sync_one(page_address(page), address);

... this assignment (as the patch doesn't change
vmalloc_sync_one()'s return type).

>                       spin_unlock(pgt_lock);
>  
> -                     if (!ret)
> +                     if (!pmd)
>                               break;
>               }
>               spin_unlock_irqrestore(&pgd_lock, flags);
> 
> With this patch it still fails, at least some of the time.  I no
> longer see a crash, however.  Instead, the host seems to lock up.
> 
> Here is an example where the host locks up during xm restore:
> 
> 2013-03-09 01:52:27 Z executing ssh ... root@xxxxxxxxxxxx xm restore  image
> 2013-03-09 01:55:47 Z command timed out [200]: ssh -o 
> StrictHostKeyChecking=no 
> -o BatchMode=yes -o ConnectTimeout=100 -o ServerAliveInterval=100 -o 
> PasswordAuthentication=no -o ChallengeResponseAuthentication=no -o 
> UserKnownHostsFile=tmp/t.known_hosts_17183.test-amd64-i386-pv 
> root@xxxxxxxxxxxx 
> xm restore  image
> status (timed out) at Osstest/TestSupport.pm line 375.
> 
> After this the host is unresponsive until power cycled.
> On the serial console:
> 
> Mar  8 17:29:52.243032 [  182.877329] vmalloc sync one failure for 0xf7800000
> Mar  8 17:29:52.243069 [  182.877342]   pgd d6df4018 = 0x1b613001
> Mar  8 17:29:52.251048 [  182.877353] pgd_k c1667018 = 0x01812027
> Mar  8 17:29:52.251080 [  182.877364]   pud d6df4018 = 0x1b613001
> Mar  8 17:29:52.251115 [  182.877375] pud_k c1667018 = 0x01812027
> Mar  8 17:29:52.263034 [  182.877386]   pmd db613de0 = 0x1fc13067
> Mar  8 17:29:52.263067 [  182.877396] pmd_k c1812de0 = 0x1fc13067
> Mar  8 17:29:52.263102 [  182.877407] pmd page   c5fd4260
> Mar  8 17:29:52.271050 [  182.877417] pmd_k page c5fd4260
> Mar  8 17:29:52.271081 [  182.877429] vmalloc sync one failure for 0xf7800000
> Mar  8 17:29:52.283035 [  182.877441]   pgd df489018 = 0x135fe001
> Mar  8 17:29:52.283069 [  182.877452] pgd_k c1667018 = 0x01812027
> Mar  8 17:29:52.283103 [  182.877462]   pud df489018 = 0x135fe001
> Mar  8 17:29:52.291047 [  182.877473] pud_k c1667018 = 0x01812027
> Mar  8 17:29:52.291080 [  182.877484]   pmd d35fede0 = 0x1fc13067
> Mar  8 17:29:52.303031 [  182.877494] pmd_k c1812de0 = 0x1fc13067
> Mar  8 17:29:52.303064 [  182.877505] pmd page   c5fd4260
> Mar  8 17:29:52.303093 [  182.877515] pmd_k page c5fd4260
> (lots more; note to self: flight 17183)

Two cases that simply would have gone unnoticed without the
printing, so I wonder whether it's the amount of printing that
causes the lockup.

Looking further I see that the kernel really is an old 2.6.32 based
one, perhaps from Jeremy's mostly obsolete tree. I'm not sure how
much value testing with that kernel has, considering the amount of
bug fixes that went into newer code.

But then at least the second half of the patch above makes sense.
Yet it makes obvious that the code is at least missing the equivalent
of commit a79e53d (which could explain a deadlock, but not the
crash).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.