[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen crash on S3 resume on 4.13 and unstable if any CPU is re-offlined


  • To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Sun, 5 Jan 2020 00:42:30 +0000
  • Authentication-results: esa6.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none; spf=None smtp.pra=andrew.cooper3@xxxxxxxxxx; spf=Pass smtp.mailfrom=Andrew.Cooper3@xxxxxxxxxx; spf=None smtp.helo=postmaster@xxxxxxxxxxxxxxx
  • Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= mQINBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABtClBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPokCOgQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86LkCDQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAYkC HwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
  • Cc: Michał Kowalczyk <mkow@xxxxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Sun, 05 Jan 2020 00:43:02 +0000
  • Ironport-sdr: aJvn61/x0dq/GXH3d13o3PBKsY+PLtiyFzdIuvePQbJ33BDP9iuV03eJyh+YTt18dp6MgUR7eG yISX5GFFeaMFN5pipK2r3JSNqBV2QNrEbHGmG87AnBlWJ2lXqG8lJUH9jL4s/UwLJloHhbAgtS /6McViwnYy2nwahjIVHvWDVPU2D9zaLPecY/u1HOZnafIOZIlOSWKUlB5/dvgH47TiTPZFvf8E +q6kbiFzwoqMidTWkrP/Qb51phqFYw83rFYlgGvrx+1Z5cUM6NRyyQGf7Bly4iFlbIRBFYayMS YjE=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 04/01/2020 15:30, Marek Marczykowski-Górecki wrote:
> Hi,
>
> I have a reliable crash on resume from S3. I can reproduce it on both
> real hardware and nested within KVM, although call traces are different
> between those platforms. In any case, it happens only if some CPU is to
> be re-offlined after resume (smt=off and/or maxcpus=... options).
>
> I think the crash from the real hardware gives more clues, but the one
> from qemu may also be interesting, maybe it's even another bug?
>
> The crash message (full console log attached):
>
> (XEN) mce_intel.c:772: MCA Capability: firstbank 0, extended MCE MSR 0, 
> BCAST, CMCI
> (XEN) CPU0 CMCI LVT vector (0xf2) already installed
> (XEN) Finishing wakeup from ACPI S3 state.
> (XEN) Enabling non-boot CPUs  ...
> (XEN) ----[ Xen-4.14-unstable  x86_64  debug=y   Not tainted ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82d08023beb7>] 
> schedule.c#cpu_schedule_callback+0xea/0x1a1
> (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: ffff82d080453348   rcx: ffff82d080584020
> (XEN) rdx: 000000339b66e000   rsi: 0000000000008005   rdi: ffff82d080453340
> (XEN) rbp: ffff8300ca45fd68   rsp: ffff8300ca45fd68   r8:  0000000000000004
> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 8000000000000000
> (XEN) r12: ffff82d080453340   r13: ffff82d080453200   r14: 0000000000008005
> (XEN) r15: 0000000000008000   cr0: 000000008005003b   cr4: 00000000000426e0
> (XEN) cr3: 00000000ca44f000   cr2: 0000000000000008
> (XEN) fsb: 000079d5e4f9e740   gsb: ffff888135600000   gss: 0000000000000000
> (XEN) ds: 0018   es: 0010   fs: b800   gs: 0010   ss: 0000   cs: e008
> (XEN) Xen code around <ffff82d08023beb7> 
> (schedule.c#cpu_schedule_callback+0xea/0x1a1):
> (XEN)  48 8b 14 d1 48 8b 04 02 <48> 8b 48 08 48 85 c9 74 64 48 8b 05 b9 c3 32 
> 00
> (XEN) Xen stack trace from rsp=ffff8300ca45fd68:
> (XEN)    ffff8300ca45fdb0 ffff82d080221289 ffff8300ca45fdd8 0000000000000001
> (XEN)    0000000000000000 00000000ffffffef ffff8300ca45fe00 0000000000000001
> (XEN)    0000000000000200 ffff8300ca45fdc8 ffff82d080203476 0000000000000001
> (XEN)    ffff8300ca45fdf0 ffff82d080203550 0000000000000000 0000000000000001
> (XEN)    0000000000000000 ffff8300ca45fe20 ffff82d080203999 ffff8300ca45fef8
> (XEN)    0000000000000000 0000000000000003 00000000000426e0 ffff8300ca45fe58
> (XEN)    ffff82d0802e4240 ffff83042896c5f0 ffff83041bb4d000 0000000000000000
> (XEN)    0000000000000000 ffff83041bb73000 ffff8300ca45fe78 ffff82d08020828f
> (XEN)    ffff83041bb4d1b8 ffff82d080567210 ffff8300ca45fe90 ffff82d08023fd39
> (XEN)    ffff82d080567200 ffff8300ca45fec0 ffff82d08024001a 0000000000000000
> (XEN)    ffff82d080567210 ffff82d08056d980 ffff82d080584020 ffff8300ca45fef0
> (XEN)    ffff82d08027247a ffff83041bbb2000 ffff83041bb4d000 ffff83041bbb3000
> (XEN)    0000000000000000 ffff8300ca45fd98 0000000000000003 ffffffff820ae496
> (XEN)    0000000000000003 0000000000000000 0000000000002003 ffffffff822c6868
> (XEN)    0000000000000246 0000000000003403 00000000ffff0000 0000000000000000
> (XEN)    0000000000000000 ffffffff810010ea 0000000000002003 0000000000000010
> (XEN)    deadbeefdeadf00d 0000010000000000 ffffffff810010ea 000000000000e033
> (XEN)    0000000000000246 ffffc900011abbe8 000000000000e02b 003b4a890045ffe0
> (XEN)    003b4ddf00098fa8 003b4e0300000001 003b499d0045ffe0 0000e01000000000
> (XEN)    ffff83041bbb2000 0000000000000000 00000000000426e0 0000000000000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08023beb7>] R schedule.c#cpu_schedule_callback+0xea/0x1a1
> (XEN)    [<ffff82d080221289>] F notifier_call_chain+0x6b/0x96
> (XEN)    [<ffff82d080203476>] F cpu.c#cpu_notifier_call_chain+0x1b/0x33
> (XEN)    [<ffff82d080203550>] F cpu_down+0x5e/0x15c
> (XEN)    [<ffff82d080203999>] F enable_nonboot_cpus+0x113/0x1fb
> (XEN)    [<ffff82d0802e4240>] F power.c#enter_state_helper+0x107/0x51b
> (XEN)    [<ffff82d08020828f>] F 
> domain.c#continue_hypercall_tasklet_handler+0x8b/0xb7
> (XEN)    [<ffff82d08023fd39>] F tasklet.c#do_tasklet_work+0x76/0xa9
> (XEN)    [<ffff82d08024001a>] F do_tasklet+0x58/0x8a
> (XEN)    [<ffff82d08027247a>] F domain.c#idle_loop+0x40/0x96
> (XEN) 
> (XEN) Pagetable walk from 0000000000000008:
> (XEN)  L4[0x000] = 000000041bbff063 ffffffffffffffff
> (XEN)  L3[0x000] = 000000041bbfe063 ffffffffffffffff
> (XEN)  L2[0x000] = 000000041bbfd063 ffffffffffffffff
> (XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
> (XEN) 
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) FATAL PAGE FAULT
> (XEN) [error_code=0000]
> (XEN) Faulting linear address: 0000000000000008
> (XEN) ****************************************
>
> And the one from qemu:
>
> (XEN) mce_intel.c:772: MCA Capability: firstbank 1, extended MCE MSR 0, SER
> (XEN) Finishing wakeup from ACPI S3 state.
> (XEN) Enabling non-boot CPUs  ...
> (XEN) Assertion 'c2rqd(ops, sched_unit_master(unit)) == svc->rqd' failed at 
> sched_credit2.c:2137
> (XEN) ----[ Xen-4.14-unstable  x86_64  debug=y   Not tainted ]----
> (XEN) CPU:    1
> (XEN) RIP:    e008:[<ffff82d08022fe1a>] 
> sched_credit2.c#csched2_unit_wake+0x174/0x176
> (XEN) RFLAGS: 0000000000010097   CONTEXT: hypervisor (d0v0)
> (XEN) rax: ffff83013a7313e8   rbx: ffff83013a6bdf40   rcx: 0000000000000051
> (XEN) rdx: ffff83013a731160   rsi: ffff83013a7310e0   rdi: 0000000000000003
> (XEN) rbp: ffff83013a6f7d98   rsp: ffff83013a6f7d78   r8:  deadbeefdeadf00d
> (XEN) r9:  deadbeefdeadf00d   r10: 0000000000000000   r11: 0000000000000000
> (XEN) r12: ffff83013a6bc7e0   r13: ffff82d08043e720   r14: 0000000000000003
> (XEN) r15: 00000003c5ffecac   cr0: 0000000080050033   cr4: 0000000000000660
> (XEN) cr3: 000000004b005000   cr2: 0000000000000000
> (XEN) fsb: 00007751649f4740   gsb: ffff888134a00000   gss: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen code around <ffff82d08022fe1a> 
> (sched_credit2.c#csched2_unit_wake+0x174/0x176):
> (XEN)  ef e8 1e c1 ff ff eb a7 <0f> 0b 55 48 89 e5 41 57 41 56 41 55 41 54 53 
> 48
> (XEN) Xen stack trace from rsp=ffff83013a6f7d78:
> (XEN)    ffff83013a6a3000 ffff83013a6bdf40 ffff83013a6bdf40 ffff83013a7313e8
> (XEN)    ffff83013a6f7de8 ffff82d0802391f8 0000000000000202 ffff83013a7313e8
> (XEN)    ffff83013a6c1018 0000000000000001 0000000000000000 0000000000000000
> (XEN)    ffff83013a6c1018 ffff83013a6a3000 ffff83013a6f7e58 ffff82d08020906c
> (XEN)    ffff82d08035d3d4 ffff82d08035d3c8 ffff82d08035d3d4 ffff82d08035d3c8
> (XEN)    ffff82d08035d3d4 ffff82d08035d3c8 ffff82d08035d3d4 ffff83013a6f7ef8
> (XEN)    0000000000000180 ffff83013a6aa000 deadbeefdeadf00d 0000000000000003
> (XEN)    ffff83013a6f7ee8 ffff82d0803570c7 0000000000000001 0000000000000001
> (XEN)    0000000000000000 deadbeefdeadf00d deadbeefdeadf00d ffff82d08035d3c8
> (XEN)    ffff82d08035d3d4 ffff82d08035d3c8 ffff82d08035d3d4 ffff82d08035d3c8
> (XEN)    ffff82d08035d3d4 ffff83013a6aa000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 00007cfec59080e7 ffff82d08035d432
> (XEN)    0000000000015120 0000000000000001 0000000000000000 ffff88813024a540
> (XEN)    0000000000000000 0000000000000001 0000000000000246 0000000000140000
> (XEN)    ffff8880bf7db000 ffffea0004be4508 0000000000000018 ffffffff8100130a
> (XEN)    0000000000000000 0000000000000001 0000000000000001 0000010000000000
> (XEN)    ffffffff8100130a 000000000000e033 0000000000000246 ffffc90000c97c98
> (XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000e01000000001 ffff83013a6aa000 00000030ba196000
> (XEN)    0000000000000660 0000000000000000 000000013a6e2000 0000040000000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08022fe1a>] R sched_credit2.c#csched2_unit_wake+0x174/0x176
> (XEN)    [<ffff82d0802391f8>] F vcpu_wake+0xea/0x4d8
> (XEN)    [<ffff82d08020906c>] F do_vcpu_op+0x36f/0x687
> (XEN)    [<ffff82d0803570c7>] F pv_hypercall+0x28f/0x57d
> (XEN)    [<ffff82d08035d432>] F lstar_enter+0x112/0x120
> (XEN) 
> (XEN) 
> (XEN) ****************************************
> (XEN) Panic on CPU 1:
> (XEN) Assertion 'c2rqd(ops, sched_unit_master(unit)) == svc->rqd' failed at 
> sched_credit2.c:2137
> (XEN) ****************************************

This looks very much like the core scheduling crash found on specific
machines in S5.  From my analysis, it was a use-after-free on a
schedulling resource.

Does switching back to thread mode (as opposed to core mode) make the
crash go away?

~Andrew

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.