[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] xen/sched: fix sched_move_domain()
- To: Henry Wang <Henry.Wang@xxxxxxx>
- From: Juergen Gross <jgross@xxxxxxxx>
- Date: Thu, 19 Oct 2023 13:38:41 +0200
- Authentication-results: smtp-out2.suse.de; none
- Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==
- Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>
- Delivery-date: Thu, 19 Oct 2023 11:38:47 +0000
- List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
On 19.10.23 13:31, Henry Wang wrote:
Hi Juergen,
On Oct 19, 2023, at 19:23, Juergen Gross <jgross@xxxxxxxx> wrote:
When moving a domain out of a cpupool running with the credit2
scheduler and having multiple run-queues, the following ASSERT() can
be observed:
(XEN) Xen call trace:
(XEN) [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7
(XEN) [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1
(XEN) [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b
(XEN) [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35
(XEN) [<ffff82d040206513>] S domain_kill+0xa5/0x116
(XEN) [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951
(XEN) [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143
(XEN) [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9
(XEN) [<ffff82d0402012b7>] S lstar_enter+0x137/0x140
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at
common/sched/credit2.c:1159
(XEN) ****************************************
This is happening as sched_move_domain() is setting a different cpu
for a scheduling unit without telling the scheduler. When this unit is
removed from the scheduler, the ASSERT() will trigger.
In non-debug builds the result is usually a clobbered pointer, leading
to another crash a short time later.
Fix that by swapping the two involved actions (setting another cpu and
removing the unit from the scheduler).
Cc: Henry Wang <Henry.Wang@xxxxxxx>
Emmm, I think ^ this CC is better to me moved to the scissors line, otherwise
if this patch is committed, this line will be shown in the commit message...
Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with
different granularity")
Signed-off-by: Juergen Gross <jgross@xxxxxxxx>
---
This fixes a regression introduced in Xen 4.15. The fix is very simple
and it will affect only configurations with multiple cpupools. I think
whether to include it in 4.18 should be decided by the release manager
based on the current state of the release (I think I wouldn't have
added it that late in the release while being the release manager).
Thanks for the reminder :)
Please correct me if I am wrong, if this is fixing the regression introduced in
4.15, shouldn’t this patch being backported to 4.15, 4.16, 4.17 and soon
4.18? So honestly I think at least for 4.18 either add this patch now or
later won’t make much difference…I am ok either way I guess.
You are right, the patch needs to be backported.
OTOH nobody noticed the regression until a few days ago. So delaying the fix
for a few weeks would probably not hurt too many people. Each patch changing
code is a risk to introduce another regression, so its your decision whether
you want to take this risk. Especially changes like this one touching the
core scheduling code always have a latent risk to open up a small race window
(this could be said for the original patch I'm fixing here, too :-) ).
Juergen
Attachment:
OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature
|