Xen project Mailing List

Re: [Xen-devel] xenstored crashes with SIGSEGV

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

From: Philipp Hahn <hahn@xxxxxxxxxxxxx>

Date: Fri, 12 Dec 2014 18:20:58 +0100

Delivery-date: Fri, 12 Dec 2014 17:21:27 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hello Ian, On 12.12.2014 17:56, Ian Campbell wrote: > On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote: >> On 12.12.2014 17:32, Ian Campbell wrote: >>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote: >>>> We did enable tracing and now have the xenstored-trace.log of one crash: >>>> It contains 1.6 billion lines and is 83 GiB. >>>> It just shows xenstored to crash on TRANSACTION_START. >>>> >>>> Is there some tool to feed that trace back into a newly launched xenstored? >>> >>> Not that I know of I'm afraid. >> >> Okay, then I have to continue with my own tool. > > If you do end up developing a tool to replay a xenstore trace then I > think that'd be something great to have in tree! I just need to figure out how to talk to xenstored on the wire: for some strange reason xenstored is closing the connection to the UNIX socket on the first write inside a transaction. Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py... >>> Do you get a core dump when this happens? You might need to fiddle with >>> ulimits (some distros disable by default). IIRC there is also some /proc >>> nob which controls where core dumps go on the filesystem. >> >> Not for that specific trace: We first enabled generating core files, but >> only then discovered that this is not enough. > > How wasn't it enough? You mean you couldn't use gdb to extract a > backtrace from the core file? Or was something else wrong? The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus. (gdb) bt full #0 talloc_chunk_from_ptr (ptr=0xff00000000) at talloc.c:116 tc = <value optimized out> #1 0x0000000000407edf in talloc_free (ptr=0xff00000000) at talloc.c:551 tc = <value optimized out> #2 0x000000000040a348 in tdb_open_ex (name=0x1941fb0 "/var/lib/xenstored/tdb.0x1935bb0", hash_size=<value optimized out>, tdb_flags=0, open_flags=<value optimized out>, mode=<value optimized out>, log_fn=0x4093b0 <null_log_fn>, hash_fn=<value optimized out>) at tdb.c:1958 tdb = 0x1921270 st = {st_dev = 17, st_ino = 816913342, st_nlink = 1, st_mode = 33184, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 303104, st_blksize = 4096, st_blocks = 592, st_atim = {tv_sec = 1415748063, tv_nsec = 87562634}, st_mtim = {tv_sec = 1415748063, tv_nsec = 87562634}, st_ctim = { tv_sec = 1415748063, tv_nsec = 87562634}, __unused = {0, 0, 0}} rev = <value optimized out> locked = 4232112 vp = <value optimized out> #3 0x000000000040a684 in tdb_open (name=0xff00000000 <Address 0xff00000000 out of bounds>, hash_size=0, tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773 No locals. #4 0x000000000040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0 "/var/lib/xenstored/tdb.0x1935bb0") at tdb.c:2124 fd = <value optimized out> saved_errno = <value optimized out> copy = 0x0 #5 0x0000000000406c2d in do_transaction_start (conn=0x1939550, in=<value optimized out>) at xenstored_transaction.c:164 trans = 0x1935bb0 exists = <value optimized out> id_str = "\300L\222\001\000\000\000\000\330!@\000\000\000\000\000P\225\223\001" #6 0x00000000004045ca in process_message (conn=0x1939550) at xenstored_core.c:1214 trans = <value optimized out> #7 consider_message (conn=0x1939550) at xenstored_core.c:1261 No locals. #8 handle_input (conn=0x1939550) at xenstored_core.c:1308 bytes = <value optimized out> in = <value optimized out> #9 0x0000000000405170 in main (argc=<value optimized out>, argv=<value optimized out>) at xenstored_core.c:1964 A 3rd trace is somewhere completely different: (gdb) bt #0 0x00007fcbf066088d in _IO_vfprintf_internal (s=0x7fff46ac3010, format=<value optimized out>, ap=0x7fff46ac3170) at vfprintf.c:1617 #1 0x00007fcbf0682732 in _IO_vsnprintf (string=0x7fff46ac318f "", maxlen=<value optimized out>, format=0x40d4a4 "%.*s", args=0x7fff46ac3170) at vsnprintf.c:120 #2 0x000000000040855b in talloc_vasprintf (t=0x17aaf20, fmt=0x40d4a4 "%.*s", ap=0x7fff46ac31d0) at talloc.c:1104 #3 0x0000000000408666 in talloc_asprintf (t=0x1f, fmt=0xffffe938 <Address 0xffffe938 out of bounds>) at talloc.c:1129 #4 0x0000000000403a38 in ask_parents (conn=0x177a1f0, name=0x17aaf20 "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ) at xenstored_core.c:492 #5 errno_from_parents (conn=0x177a1f0, name=0x17aaf20 "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ) at xenstored_core.c:516 #6 get_node (conn=0x177a1f0, name=0x17aaf20 "/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ) at xenstored_core.c:543 #7 0x000000000040481d in do_read (conn=0x177a1f0) at xenstored_core.c:744 #8 process_message (conn=0x177a1f0) at xenstored_core.c:1178 #9 consider_message (conn=0x177a1f0) at xenstored_core.c:1261 #10 handle_input (conn=0x177a1f0) at xenstored_core.c:1308 #11 0x0000000000405170 in main (argc=<value optimized out>, argv=<value optimized out>) at xenstored_core.c:1964 >> It might be interesting to see what happens if you preserve the db and >> reboot arranging for the new xenstored to start with the old file. If >> the corruption is part of the file then maybe it can be induced to crash >> again more quickly. > > Thanks for the pointer, will try. Didn't crash immediately. Now running /usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py for the weekend. Thanks again. Philipp _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.