[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] xenstored crashes with SIGSEGV



On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 15.12.2014 18:45, Ian Campbell wrote:
> > On Mon, 2014-12-15 at 14:50 +0000, Ian Campbell wrote:
> >> On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
> >>> I just noticed something strange:
> >>>
> >>>> #3  0x000000000040a684 in tdb_open (name=0xff00000000 <Address
> >>>> 0xff00000000 out of bounds>, hash_size=0,
> >>>>     tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> ...
> > I'm reasonably convinced now that this is just a weird artefact of
> > running gdb on an optimised binary, probably a shortcoming in the debug
> > info leading to gdb getting confused.
> > 
> > Unfortunately this also calls into doubt the parameter to talloc_free,
> > perhaps in that context 0xff0000000 is a similar artefact.
> > 
> > Please can you print the entire contents of tdb in the second frame
> > ("print *tdb" ought to do it). I'm curious whether it is all sane or
> > not.
> 
> (gdb) print *tdb
> $1 = {name = 0x0, map_ptr = 0x0, fd = 47, map_size = 65280, read_only =
> 16711680,
>   locked = 0xff0000000000,

So it really does seem to be 0xff0000000000 in memory.

> flags = 0,
> travlocks = {
>     next = 0xff0000, off = 0, hash = 65280}, next = 0xff0000,
>   device = 280375465082880, inode = 16711680, log_fn = 0x4093b0
> <null_log_fn>,
>   hash_fn = 0x4092f0 <default_tdb_hash>, open_flags = 2}

And here we can see tdb->{flags,open_flags} == 0 and 2, contrary to what
the stack trace says we were called with, which was nonsense. Since 0
and 2 are sensible and correspond to what the caller passes I think the
stack trace is just confused.

> (gdb) info registers
> rax            0x0      0
> rbx            0x16bff70        23854960
> rcx            0xffffffffffffffff       -1
> rdx            0x40ecd0 4254928
> rsi            0x0      0
> rdi            0xff0000000000   280375465082880

And here it is in the registers.

> rbp            0x7fcaed6c96a8   0x7fcaed6c96a8
> rsp            0x7fff9dc86330   0x7fff9dc86330
> r8             0x7fcaece54c08   140509534571528
> r9             0xff00000000000000       -72057594037927936
> r10            0x7fcaed08c14c   140509536895308
> r11            0x246    582
> r12            0xd      13
> r13            0xff0000000000   280375465082880

And again.

> r14            0x4093b0 4232112
> r15            0x167d620        23582240
> rip            0x4075c4 0x4075c4 <talloc_chunk_from_ptr+4>

This must be the faulting address.

> eflags         0x10206  [ PF IF RF ]
> cs             0x33     51
> ss             0x2b     43
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> fctrl          0x0      0
> fstat          0x0      0
> ftag           0x0      0
> fiseg          0x0      0
> fioff          0x0      0
> foseg          0x0      0
> fooff          0x0      0
> fop            0x0      0
> mxcsr          0x0      [ ]
> 
> (gdb) disassemble
> Dump of assembler code for function talloc_chunk_from_ptr:
> 0x00000000004075c0 <talloc_chunk_from_ptr+0>:   sub    $0x8,%rsp
> 0x00000000004075c4 <talloc_chunk_from_ptr+4>:   mov    -0x8(%rdi),%edx

This is the line corresponding to %rip above which is doing a read via %
rdi, which is 0xff0000000000.

It's reading tc->flags. It's been optimised, tc = pp - SIZE, so it is
loading *(pp-SIZE+offsetof(flags)), which is pp-8 (flags is the last
field in the struct).

So rdi contains pp which == the ptr given as an argument to the
function, so ptr was bogus.

So it seems we really do have tdb->locked containing 0xff0000000000.

This is only allocated in one place which is:
        tdb->locked = talloc_zero_array(tdb, struct tdb_lock_type,
                                        tdb->header.hash_size+1);
midway through tdb_open_ex. It might be worth inserting a check+log for
this returning  0xff, 0xff00, 0xff0000 ... 0xff0000000000 etc.

> 0x00000000004075c7 <talloc_chunk_from_ptr+7>:   lea    -0x50(%rdi),%rax

This is actually calculating tc, ready for return upon success.

> 0x00000000004075cb <talloc_chunk_from_ptr+11>:  mov    %edx,%ecx
> 0x00000000004075cd <talloc_chunk_from_ptr+13>:  and    
> $0xfffffffffffffff0,%ecx
> 0x00000000004075d0 <talloc_chunk_from_ptr+16>:  cmp    $0xe814ec70,%ecx
> 0x00000000004075d6 <talloc_chunk_from_ptr+22>:  jne    0x4075e2 
> <talloc_chunk_from_ptr+34>

(tc->flags & ~0xF) != TALLOC_MAGIC

> 0x00000000004075d8 <talloc_chunk_from_ptr+24>:  and    $0x1,%edx
> 0x00000000004075db <talloc_chunk_from_ptr+27>:  jne    0x4075e2 
> <talloc_chunk_from_ptr+34>

tc->flags & TALLOC_FLAG_FREE

> 0x00000000004075dd <talloc_chunk_from_ptr+29>:  add    $0x8,%rsp
> 0x00000000004075e1 <talloc_chunk_from_ptr+33>:  retq

Success, return.

> 0x00000000004075e2 <talloc_chunk_from_ptr+34>:  nopw   0x0(%rax,%rax,1)
> 0x00000000004075e8 <talloc_chunk_from_ptr+40>:  callq  0x401b98 <abort@plt>

The two TALLOC_ABORTS both end up here if the checks above fail.

> > Can you also "p $_siginfo._sifields._sigfault.si_addr" (in frame 0).
> > This ought to be the actual faulting address, which ought to give a hint
> > on how much we can trust the parameters in the stack trace.
> 
> Hmm, my gdb refused to access $_siginfo:
> (gdb) show convenience
> $_siginfo = Unable to read siginfo

That's ok, I think I've convinced myself above what the crash is.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.