Xen project Mailing List

Re: [Xen-devel] Race condition with scheduler runqueues

From: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>

Date: Wed, 27 Feb 2013 07:28:11 +0100

Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Xen-devel List <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Wed, 27 Feb 2013 06:29:28 +0000

Domainkey-signature: s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:Content-Type; b=sUeIcsn2yBOVgp7luDzDJOrSkqHIZUvFU08D1hMbMCo+1aEZDr+2lFzu D8pSTPcFCtqp+ixLgGKOzSpoLxkzFRp2sm2WCullTKgRFf2Eq9XAQWCQj dOR7bd0abqRSlB+T0GTnh3uQIGW/MnEoOzNXIuy2OHl8gqO4mpx0gGRWX k9YMvkcd3+wxNqxD9Vh4FT3E7HoWh9GYQUqa144VA7psFp+JCJP/S7fwc RtodAYavKlYKs8lW5AtCplnTe/gDR;

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 27.02.2013 07:19, Juergen Gross wrote:

On 19.02.2013 10:28, Jan Beulich wrote:

On 18.02.13 at 19:11, Andrew Cooper<andrew.cooper3@xxxxxxxxxx> wrote:

Hello,

Our testing has discovered a crash (pagefault at 0x0000000000000008)
which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
in sched_credit.c (because a static function of the same name also
exists in sched_credit2.c, which confused matters to start with)

The test case was a loop of localhost migrate of a 1vcpu HVM win8
domain. The test case itself has passed many times in the past on the
same Xen codebase (Xen-4.1.3), indicating that it is very rare. There
does not appear to be any relevant changes between the version of Xen in
the test and xen-4.1-testing.

The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
from dom0, targeting the VCPU of the migrating domain.

(XEN) Xen call trace:
(XEN) [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
(XEN) 0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
(XEN) 12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
(XEN) 14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
(XEN) 24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
(XEN) 64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80

The relevant part of csched_vcpu_sleep() is

else if ( __vcpu_on_runq(svc) )
__runq_remove(svc);

which disassembles to

ffff82c480116a01: 49 8b 10 mov (%r8),%rdx
ffff82c480116a04: 4c 39 c2 cmp %r8,%rdx
ffff82c480116a07: 75 07 jne ffff82c480116a10
<csched_vcpu_sleep+0x40>
ffff82c480116a09: f3 c3 repz retq
ffff82c480116a0b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
ffff82c480116a10: 49 8b 40 08 mov 0x8(%r8),%rax
ffff82c480116a14: 48 89 42 08 mov %rax,0x8(%rdx) #
<- Pagefault here
ffff82c480116a18: 48 89 10 mov %rdx,(%rax)
ffff82c480116a1b: 4d 89 40 08 mov %r8,0x8(%r8)
ffff82c480116a1f: 4d 89 00 mov %r8,(%r8)

The relevant crash registers from the pagefault are:
rax: 0000000000000000
rdx: 0000000000000000
r8: ffff83080c89ed90

If I am reading the code correctly, this means that runq->next is NULL,
so we fail list_empty() and erroneously pass __vcpu_on_runq(). We then
fail with a fault when trying to update runq->prev, which is also NULL.

The only place I can spot in the code where the runq->{next,prev} could
conceivably be NULL is in csched_alloc_vdata() between the memset() and
INIT_LIST_HEAD(). This is logically sensible in combination with the
localhost migrate loop, and I cant immediately see anything to prevent
this race happening.


But that doesn't make sense: csched_alloc_vdata() doesn't store
svc into vc->sched_priv; that's being done by the generic
scheduler code once the actor returns.

So I'd rather suspect a stale pointer being used, which is easily
possible when racing with sched_move_domain() (as opposed to
schedule_cpu_switch(), where the new pointer gets stored
_before_ de-allocating the old one).

However, sched_move_domain() (as well as schedule_cpu_switch())
get called only from CPU pools code, and I would guess CPU pools
aren't involved here, and you don't in parallel soft offline/online
pCPU-s (as I'm sure you otherwise would have mentioned it).

But wait - libxl__domain_make() _unconditionally_ calls
xc_cpupool_movedomain(), as does XendDomainInfo's
_constructDomain(). The reason for this escapes me - Jürgen? Yet
I'd expect the pool ID matching check to short cut the resulting
sysctl, i.e. sched_move_domain() ought to not be reached anyway
(worth verifying of course).

The race there nevertheless ought to be fixed.


Something like the attached patch?

Not tested thoroughly yet.

Argh. Sent an old version, sorry. This one should be better. Juergen -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@xxxxxxxxxxxxxx Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html

Attachment: movedom.patch
Description: Text Data

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.