Xen project Mailing List

Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED

To: Julien Grall <julien.grall.oss@xxxxxxxxx>, osstest service owner <osstest-admin@xxxxxxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Jürgen Groß <jgross@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xumengpanda@xxxxxxxxx

From: Julien Grall <julien@xxxxxxx>

Date: Fri, 10 Jan 2020 23:30:32 +0000

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 10 Jan 2020 23:30:47 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

(+ Meng) Hi, Sorry I forgot to cc the RTDS scheduler maintainer. On 10/01/2020 18:24, Julien Grall wrote:

Hi all,

On 08/01/2020 23:14, Julien Grall wrote:

On Wed, 8 Jan 2020 at 21:40, osstest service owner
<osstest-admin@xxxxxxxxxxxxxx> wrote:

flight 145796 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/145796/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
test-amd64-amd64-xl-rtds 15 guest-saverestore fail in 145773pass in 145796 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16guest-start/debianhvm.repeat fail in 145773 pass in 145796 test-armhf-armhf-xl-rtds 12 guest-start fail in 145773pass in 145796

It looks like this test has been failing for a while (although notreliably).

I looked at  a few flights, the cause seems to be the same:

Jan  8 15:02:14.700784 (XEN) Assertion '!unit_on_replq(svc)' failed at
sched_rt.c:586
Jan  8 15:02:26.715030 (XEN) ----[ Xen-4.14-unstable  arm32  debug=y
Not tainted ]----
Jan  8 15:02:26.720756 (XEN) CPU:    1
Jan  8 15:02:26.722158 (XEN) PC:     0023a750
common/sched_rt.c#replq_insert+0x7c/0xcc
Jan  8 15:02:26.727851 (XEN) CPSR:   200300da MODE:Hypervisor
Jan  8 15:02:26.731334 (XEN)      R0: 002a51a4 R1: 400614a0 R2:
3d64b900 R3: 40061338
Jan  8 15:02:26.736830 (XEN)      R4: 400614a0 R5: 002a51a4 R6:
3cf1cbf0 R7: 000001cb
Jan  8 15:02:26.742600 (XEN)      R8: 4003d1b0 R9: 400614a8
R10:4003d1b0 R11:400ffe54 R12:400ffde4
Jan  8 15:02:26.749119 (XEN) HYP: SP: 400ffe2c LR: 0023b6e8
Jan  8 15:02:26.752296 (XEN)
Jan  8 15:02:26.753036 (XEN)   VTCR_EL2: 80003558
Jan  8 15:02:26.755479 (XEN)  VTTBR_EL2: 00020000bbff4000
Jan  8 15:02:26.758757 (XEN)
Jan  8 15:02:26.759366 (XEN)  SCTLR_EL2: 30cd187f
Jan  8 15:02:26.761755 (XEN)    HCR_EL2: 0078663f
Jan  8 15:02:26.764250 (XEN)  TTBR0_EL2: 00000000bc029000
Jan  8 15:02:26.767364 (XEN)
Jan  8 15:02:26.767980 (XEN)    ESR_EL2: 00000000
Jan  8 15:02:26.770485 (XEN)  HPFAR_EL2: 00030010
Jan  8 15:02:26.772795 (XEN)      HDFAR: e0800f00
Jan  8 15:02:26.775272 (XEN)      HIFAR: c0605744
Jan  8 15:02:26.777748 (XEN)
Jan  8 15:02:26.778505 (XEN) Xen stack trace from sp=400ffe2c:
Jan  8 15:02:26.781910 (XEN)    00000000 3cf1cbf0 400614a0 002a51a4
3cf1cbf0 000001cb 4003d1b0 6003005a
Jan  8 15:02:26.788991 (XEN)    400613f8 400ffe7c 0023b6e8 002f9300
4004c000 400613f8 3cf1cbf0 000001cb
Jan  8 15:02:26.796093 (XEN)    4003d1b0 6003005a 400613f8 400ffeac
00242988 4004c000 002425ac 40058000
Jan  8 15:02:26.803237 (XEN)    4004c000 4004f000 10f45000 10f45008
4004b080 40058000 60030013 400ffebc
Jan  8 15:02:26.810360 (XEN)    00209984 00000002 4004f000 400ffedc
0020eddc 0020caf8 db097cd4 00000020
Jan  8 15:02:26.817504 (XEN)    c13afbec 00000000 db15fd68 400ffee4
0020c9dc 400fff34 0020d5e8 4004e000
Jan  8 15:02:26.824615 (XEN)    00000000 400fff44 400fff44 00000002
00000000 4004e8fa 4004e8f4 400fff1c
Jan  8 15:02:26.831737 (XEN)    400fff1c 6003005a 0020caf8 400fff58
00000020 c13afbec 00000000 db15fd68
Jan  8 15:02:26.838798 (XEN)    60030013 400fff54 0026c150 c1204d08
c13afbec 00000000 00000000 00000000
Jan  8 15:02:26.845877 (XEN)    00000002 400fff58 002753b0 00000009
db097cd4 db173008 00000002 c1204d08
Jan  8 15:02:26.852986 (XEN)    00000000 00000002 c13afbec 00000000
db15fd68 60030013 db15fd3c 00000020
Jan  8 15:02:26.860044 (XEN)    ffffffff b6cdccb3 c0107ed0 a0030093
4a000ea1 be951568 c136edc0 c010d3a0
Jan  8 15:02:26.867171 (XEN)    db097cd0 c056c7f8 c136edcc c010d720
c136edd8 c010d7e0 00000000 00000000
Jan  8 15:02:26.874526 (XEN)    00000000 00000000 00000000 c136ede4
c136ede4 00030030 60070193 80030093

Jan 8 15:02:26.881450 (XEN) 60030193 00000000 00000000 0000000000000001

Jan  8 15:02:26.886519 (XEN) Xen call trace:
Jan  8 15:02:26.888168 (XEN)    [<0023a750>]
common/sched_rt.c#replq_insert+0x7c/0xcc (PC)
Jan  8 15:02:26.894240 (XEN)    [<0023b6e8>]
common/sched_rt.c#rt_unit_wake+0xf4/0x274 (LR)
Jan  8 15:02:26.900246 (XEN)    [<0023b6e8>]
common/sched_rt.c#rt_unit_wake+0xf4/0x274
Jan  8 15:02:26.905775 (XEN)    [<00242988>] vcpu_wake+0x1e4/0x688
Jan  8 15:02:26.909743 (XEN)    [<00209984>] domain_unpause+0x64/0x84
Jan  8 15:02:26.913956 (XEN)    [<0020eddc>]
common/event_fifo.c#evtchn_fifo_unmask+0xd8/0xf0
Jan  8 15:02:26.920167 (XEN)    [<0020c9dc>] evtchn_unmask+0x7c/0xc0

Jan 8 15:02:26.924173 (XEN) [<0020d5e8>]do_event_channel_op+0xaf0/0xdacJan 8 15:02:26.928922 (XEN) [<0026c150>]do_trap_guest_sync+0x350/0x4d0Jan 8 15:02:26.933647 (XEN) [<002753b0>]entry.o#return_from_trap+0/0x4

Jan  8 15:02:26.938299 (XEN)
Jan  8 15:02:26.939039 (XEN)
Jan  8 15:02:26.939668 (XEN) ****************************************
Jan  8 15:02:26.943794 (XEN) Panic on CPU 1:
Jan  8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed at
sched_rt.c:586
Jan  8 15:02:26.951492 (XEN) ****************************************

I believe the domain_unpause() is coming from guest_clear_bit(). This
would mean the atomics didn't succeed without pausing the domain. This
makes sense as, per the log:

  CPU1: Guest atomics will try 1 times before pausing the domain

I am under the impression that the crash could be reproduced with just:

domain_pause_nosync(current);
domain_unpause(current);

Any insights what's wrong? I am happy to try to reproduce it tomorrowmorning.


So I managed to reproduce it on Arm by hacking the hypercall path to call:

domain_pause_nosync(current->domain);
domain_unpause(current->domain);

With a debug build and with a 2 vCPU dom0 the crash happen in a fewseconds. When the unit is not scheduled, rt_unit_wake() expects the unitto be in none of the queues.


The interaction is as following:

CPU0                           | CPU1
                                |
do_domain_pause()              |
  -> atomic_inc(&d->pause_count)    |
  -> vcpu_sleep_nosync(vCPU A)     |  schedule()
                 |    -> Lock
                                 |       -> rt_schedule()
                                 |          -> snext = runq_pick(...)
                                 |          /* return unit A (aka vCPU A)
                 |          /* Unit is not runnable */
                 |         -> Remove from the q
                                 |      [....]
                 |       -> Lock
    -> Lock            |
    -> rt_unit_sleep()        |
     /* Unit not scheduled */    |
     /* Nothing to do */        |

Note that on Arm, each vCPU has its own scheduling unit.

When schedule() grab the lock first (as shown above), the unit will onlybe removed from the Q. However, when vcpu_sleep_nosync() grab the lockfirst and the unit was not scheduled, rt_unit_sleep() will remove theunit from two queues (runQ/depleteQ and replenishQ).

So I think we want schedule() to remove the unit from the 2 queues if itis not runnable. Any opinions?


Cheers,

-- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.