Re: [Xen-devel] frequently ballooning results in qemu exit

At 05:54 +0000 on 15 Mar (1363326854), Hanweidong wrote:
> > > I'm also curious about this. There is a window between memory balloon
> > out
> > > and QEMU invalidate mapcache.
> > 
> > That by itself is OK; I don't think we need to provide any meaningful
> > semantics if the guest is accessing memory that it's ballooned out.
> > 
> > The question is where the SIGBUS comes from: either qemu has a mapping
> > of the old memory, in which case it can write to it safely, or it
> > doesn't, in which case it shouldn't try.
> The error always happened at memcpy in if (is_write) branch in
> address_space_rw.

Sure, but _why_?  Why does this access cause SIGBUS?  Presumably there's
some part of the mapcache code that thinks it has a mapping there when
it doesn't.

> We found that, after the last xen_invalidate_map_cache, the mapcache entry 
> related to the failed address was mapped:
>       ==xen_map_cache== phys_addr=7a3c1ec0 size=0 lock=0
>       ==xen_remap_bucket== begin size=1048576 ,address_index=7a3
>       ==xen_remap_bucket== end 
> entry->paddr_index=7a3,entry->vaddr_base=2a2d9000,size=1048576,address_index=7a3

OK, so that's 0x2a2d9000 -- 0x2a3d8fff.

>       ==address_space_rw== ptr=2a39aec0
>       ==xen_map_cache== phys_addr=7a3c1ec4 size=0 lock=0
>       ==xen_map_cache==first return 2a2d9000+c1ec4=2a39aec4
>       ==address_space_rw== ptr=2a39aec4
>       ==xen_map_cache== phys_addr=7a3c1ec8 size=0 lock=0
>       ==xen_map_cache==first return 2a2d9000+c1ec8=2a39aec8
>       ==address_space_rw== ptr=2a39aec8
>       ==xen_map_cache== phys_addr=7a3c1ecc size=0 lock=0
>       ==xen_map_cache==first return 2a2d9000+c1ecc=2a39aecc
>       ==address_space_rw== ptr=2a39aecc

These are all to page 0x2a3e9a___.

>       ==xen_map_cache== phys_addr=7a16c108 size=0 lock=0
>       ==xen_map_cache== return 92a407000+6c108=2a473108
>       ==xen_map_cache== phys_addr=7a16c10c size=0 lock=0
>       ==xen_map_cache==first return 2a407000+6c10c=2a47310c
>       ==xen_map_cache== phys_addr=7a16c110 size=0 lock=0
>       ==xen_map_cache==first return 2a407000+6c110=2a473110
>       ==xen_map_cache== phys_addr=7a395000 size=0 lock=0
>       ==xen_map_cache== return 2a2d9000+95000=2a36e000
>       ==address_space_rw== ptr=2a36e000

And this is to page 0x2a36e___, a different page in the same bucket.

>       here, the SIGBUS error occurred.

So that page isn't mapped.  Which means:
- it was never mapped (and the mapcache code didn't handle the error
  correctly at map time); or
- it was never mapped (and the mapcache hasn't checked its own records
  before using the map); or
- it was mapped (and something unmapped it in the meantime).

Why not add some tests in xen_remap_bucket to check that all the pages
that qemu records as mapped are actually there?


