[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] xenstored crashes with SIGSEGV
2014-12-16 12:23 GMT+00:00 Ian Campbell <Ian.Campbell@xxxxxxxxxx>: > On Tue, 2014-12-16 at 11:30 +0000, Frediano Ziglio wrote: >> 2014-12-16 11:06 GMT+00:00 Ian Campbell <Ian.Campbell@xxxxxxxxxx>: >> > On Tue, 2014-12-16 at 10:45 +0000, Ian Campbell wrote: >> >> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote: >> >> > > I notice in your bugzilla (for a different occurrence, I think): >> >> > >> [2090451.721705] univention-conf[2512]: segfault at ff00000000 ip >> >> > >> 000000000045e238 sp 00007ffff68dfa30 error 6 in >> >> > >> python2.6[400000+21e000] >> >> > > >> >> > > Which appears to have faulted access 0xff000000000 too. It looks like >> >> > > this process is a python thing, it's nothing to do with xenstored I >> >> > > assume? >> >> > >> >> > Yes, that's one univention-config, which is completely independent of >> >> > xen(stored). >> >> > >> >> > > It seems rather coincidental that it should be accessing the >> >> > > same sort of address and be faulting. >> >> > >> >> > Yes, good catch. I'll have another look at those core dumps. >> >> >> >> With this in mind, please can you confirm what model of machines you've >> >> seen this on, and in particular whether they are all the same class of >> >> machine or whether they are significantly different. >> >> >> >> The reason being that randomly placed 0xff values in a field of 0x00 >> >> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong >> >> memory pages. >> > >> > Thanks for giving me access to the core files. This is very suspicious: >> > (gdb) frame 2 >> > #2 0x000000000040a348 in tdb_open_ex (name=0x1941fb0 >> > "/var/lib/xenstored/tdb.0x1935bb0", hash_size=<value optimized out>, >> > tdb_flags=0, open_flags=<value optimized out>, mode=<value optimized out>, >> > log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at >> > tdb.c:1958 >> > 1958 SAFE_FREE(tdb->locked); >> > >> > (gdb) x/96x tdb >> > 0x1921270: 0x00000000 0x00000000 0x00000000 0x00000000 >> > 0x1921280: 0x0000001f 0x000000ff 0x0000ff00 0x000000ff >> > 0x1921290: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x19212a0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x19212b0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x19212c0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x19212d0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x19212e0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x19212f0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x1921300: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x1921310: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x1921320: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x1921330: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x1921340: 0x00000000 0x00000000 0x0000ff00 0x000000ff >> > 0x1921350: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x1921360: 0x00000000 0x000000ff 0x0000ff00 0x000000ff >> > 0x1921370: 0x004093b0 0x00000000 0x004092f0 0x00000000 >> > 0x1921380: 0x00000002 0x00000000 0x00000091 0x00000000 >> > 0x1921390: 0x0193de70 0x00000000 0x01963600 0x00000000 >> > 0x19213a0: 0x00000000 0x00000000 0x0193fbb0 0x00000000 >> > 0x19213b0: 0x00000000 0x00000000 0x00000000 0x00000000 >> > 0x19213c0: 0x00405870 0x00000000 0x0040e3e0 0x00000000 >> > 0x19213d0: 0x00000038 0x00000000 0xe814ec70 0x6f2f6567 >> > 0x19213e0: 0x01963650 0x00000000 0x0193dec0 0x00000000 >> > >> > Something has clearly done a number on the ram of this process. >> > 0x1921270 through 0x192136f is 256 bytes... >> > >> > Since it appears to be happening to other processes too I would hazard >> > that this is not a xenstored issue. >> > >> > Ian. >> > >> >> Good catch Ian! >> >> Strange corruption. Probably not related to xenstored as you >> suggested. I would be curious to see what's before the tdb pointer and >> where does the corruption starts. > > (gdb) print tdb > $2 = (TDB_CONTEXT *) 0x1921270 > (gdb) x/64x 0x1921200 > 0x1921200: 0x01921174 0x00000000 0x00000000 0x00000000 > 0x1921210: 0x01921174 0x00000000 0x00000171 0x00000000 > 0x1921220: 0x00000000 0x00000000 0x00000000 0x00000000 0x0 next (u64) 0x0 prev (u64) > 0x1921230: 0x01941f60 0x00000000 0x00000000 0x00000000 0x01941f60 parent (u64), make sense is not NULL 0x0 child (u64) > 0x1921240: 0x00000000 0x00000000 0x00000000 0x6f630065 0x0 refs (u64) 0x0 null_refs (u32) 0x6f630065 pad, garbage (u32) > 0x1921250: 0x00000000 0x00000000 0x0040e8a7 0x00000000 0x0 destructor (u64) 0x0040e8a7 name (u64) > 0x1921260: 0x00000118 0x00000000 0xe814ec70 0x00000000 0x118, size (u64) 0xe814ec70 magic (u32) 0x0 pad (u32) Well... all the talloc header seems fine to me. > 0x1921270: 0x00000000 0x00000000 0x00000000 0x00000000 > 0x1921280: 0x0000001f 0x000000ff 0x0000ff00 0x000000ff > 0x1921290: 0x00000000 0x000000ff 0x0000ff00 0x000000ff > 0x19212a0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff > 0x19212b0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff > 0x19212c0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff > 0x19212d0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff > 0x19212e0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff > 0x19212f0: 0x00000000 0x000000ff 0x0000ff00 0x000000ff > > So it appears to start at 0x1921270 or maybe ...6c. > It looks like that there is a pattern like 0x00000000 0x000000ff 0x0000ff00 0x000000ff only exceptions are when field is set after talloc_zero (fd, flags, functions). Something like the memset inside the talloc_zero fill with this pattern instead of zeroes. Note that a pattern of 16 bytes is compatible with SSE instructions size. Some bug in the save/restore for SSE registers? Some bug on SSE emulation? What does "info all-registers" gdb command say about SSE registers? Do we have a bug in Xen that affect SSE instructions (possibly already fixed after Philipp version) ? >> I also don't understand where the >> "fd = 47" came from a previous mail. 0x1f is 31, not 47 (which is >> 0x2f). > > I must have been using a different coredump to the origianl report > (there are several). > > In the one which corresponds to the above: > > (gdb) print *tdb > $3 = {name = 0x0, map_ptr = 0x0, fd = 31, map_size = 255, > read_only = 65280, locked = 0xff00000000, ecode = 65280, header = { > magic_food = > "\377\000\000\000\000\000\000\000\377\000\000\000\000\377\000\000\377\000\000\000\000\000\000\000\377\000\000\000\000\377\000", > version = 255, hash_size = 0, rwlocks = 255, reserved = {65280, > 255, 0, 255, 65280, 255, 0, 255, 65280, 255, 0, 255, 65280, > 255, 0, 255, 65280, 255, 0, 255, 65280, 255, 0, 255, 65280, > 255, 0, 255, 65280, 255, 0}}, flags = 0, travlocks = { > next = 0xff0000ff00, off = 0, hash = 255}, next = 0xff0000ff00, > device = 1095216660480, inode = 1095216725760, > log_fn = 0x4093b0 <null_log_fn>, > hash_fn = 0x4092f0 <default_tdb_hash>, open_flags = 2} > (gdb) print/x *tdb > $4 = {name = 0x0, map_ptr = 0x0, fd = 0x1f, map_size = 0xff, > read_only = 0xff00, locked = 0xff00000000, ecode = 0xff00, > header = {magic_food = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, > 0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, > 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0}, > version = 0xff, hash_size = 0x0, rwlocks = 0xff, reserved = { > 0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, > 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, > 0xff, 0xff00, 0xff, 0x0, 0xff, 0xff00, 0xff, 0x0, 0xff, > 0xff00, 0xff, 0x0}}, flags = 0x0, travlocks = { > next = 0xff0000ff00, off = 0x0, hash = 0xff}, > next = 0xff0000ff00, device = 0xff00000000, inode = 0xff0000ff00, > log_fn = 0x4093b0, hash_fn = 0x4092f0, open_flags = 0x2} > > which is consistent. > >> I would not be surprised about a strange bug in libc or the kernel. > > Or even Xen itself, or the h/w. > > Ian, > Frediano _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |