[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Xen panic when shutting down ARINC653 cpupool


  • To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, "Choi, Anderson" <Anderson.Choi@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Jürgen Groß <jgross@xxxxxxxx>
  • Date: Mon, 17 Mar 2025 15:41:59 +0100
  • Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==
  • Cc: "nathan.studer@xxxxxxxxxxxxxxx" <nathan.studer@xxxxxxxxxxxxxxx>, "stewart@xxxxxxx" <stewart@xxxxxxx>, "Weber (US), Matthew L" <matthew.l.weber3@xxxxxxxxxx>, "Whitehead (US), Joshua C" <joshua.c.whitehead@xxxxxxxxxx>
  • Delivery-date: Mon, 17 Mar 2025 14:42:08 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 17.03.25 14:29, Andrew Cooper wrote:
On 17/03/2025 1:21 pm, Choi, Anderson wrote:
Jürgen,

On 17.03.25 06:07, Choi, Anderson wrote:
I'd like to report xen panic when shutting down an ARINC653 domain
with the following setup. Note that this is only observed when
CONFIG_DEBUG is enabled.

[Test environment]
Yocto release : 5.05
Xen release : 4.19 (hash = 026c9fa29716b0ff0f8b7c687908e71ba29cf239)
Target machine : QEMU ARM64
Number of physical CPUs : 4

[Xen config]
CONFIG_DEBUG = y

[CPU pool configuration files]
cpupool_arinc0.cfg
- name= "Pool-arinc0"
- sched="arinc653"
- cpus=["2"]

[Domain configuration file]
dom1.cfg
- vcpus = 1
- pool = "Pool-arinc0"

$ xl cpupool-cpu-remove Pool-0 2
$ xl cpupool-create -f cpupool_arinc0.cfg $ xl create dom1.cfg $
a653_sched -P Pool-arinc0 dom1:100

** Wait for DOM1 to complete boot.**

$ xl shutdown dom1

[xen log] root@boeing-linux-ref:~# xl shutdown dom1 Shutting down
domain 1 root@boeing-linux-ref:~# (XEN) Assertion '!in_irq() &&
(local_irq_is_enabled() || num_online_cpus() <= 1)' failed at
common/xmalloc_tlsf.c:714 (XEN) ----[ Xen-4.19.1-pre  arm64  debug=y
Tainted: I      ]---- (XEN) CPU:    2 (XEN) PC:     00000a000022d2b0
xfree+0x130/0x1a4 (XEN) LR:     00000a000022d2a4 (XEN) SP:
00008000fff77b50 (XEN) CPSR:   00000000200002c9 MODE:64-bit EL2h
(Hypervisor, handler) ... (XEN) Xen call trace: (XEN)
[<00000a000022d2b0>] xfree+0x130/0x1a4 (PC) (XEN)
[<00000a000022d2a4>] xfree+0x124/0x1a4 (LR) (XEN)
[<00000a00002321f0>] arinc653.c#a653sched_free_udata+0x50/0xc4 (XEN)
[<00000a0000241bc0>] core.c#sched_move_domain_cleanup+0x5c/0x80 (XEN)
  [<00000a0000245328>] sched_move_domain+0x69c/0x70c (XEN)
[<00000a000022f840>] cpupool.c#cpupool_move_domain_locked+0x38/0x70
(XEN)    [<00000a0000230f20>] cpupool_move_domain+0x34/0x54 (XEN)
[<00000a0000206c40>] domain_kill+0xc0/0x15c (XEN)
[<00000a000022e0d4>] do_domctl+0x904/0x12ec (XEN)
[<00000a0000277a1c>] traps.c#do_trap_hypercall+0x1f4/0x288 (XEN)
[<00000a0000279018>] do_trap_guest_sync+0x448/0x63c (XEN)
[<00000a0000262c80>] entry.o#guest_sync_slowpath+0xa8/0xd8 (XEN)
(XEN)
(XEN) **************************************** (XEN) Panic on CPU 2:
(XEN) Assertion '!in_irq() && (local_irq_is_enabled() ||
num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714 (XEN)
****************************************

In commit 19049f8d (sched: fix locking in a653sched_free_vdata()),
locking
was introduced to prevent a race against the list manipulation but
leads to assertion failure when the ARINC 653 domain is shutdown.
I think this can be fixed by calling xfree() after
spin_unlock_irqrestore() as shown below.

xen/common/sched/arinc653.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-) diff --git
a/xen/common/sched/arinc653.c b/xen/common/sched/arinc653.c index
7bf288264c..1615f1bc46 100644
--- a/xen/common/sched/arinc653.c
+++ b/xen/common/sched/arinc653.c
@@ -463,10 +463,11 @@ a653sched_free_udata(const struct scheduler
*ops,
void *priv)
       if ( !is_idle_unit(av->unit) )
           list_del(&av->list);
-    xfree(av);
       update_schedule_units(ops);
spin_unlock_irqrestore(&sched_priv->lock, flags);
+
+    xfree(av);
   }
Can I hear your opinion on this?
Yes, this seems the right way to fix the issue.

Could you please send a proper patch (please have a look at [1] in
case you are unsure how a proper patch should look like)?

Juergen

[1]
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/process/sending
-
patches.pandoc
Thanks for your opinion. Let me read through the link and submit the patch.

Other good references are:

https://lore.kernel.org/xen-devel/20250313093157.30450-1-jgross@xxxxxxxx/
https://lore.kernel.org/xen-devel/d8c08c22-ee70-4c06-8fcd-ad44fc0dc58f@xxxxxxxx/

One you hopefully recognise, and the other is another bugfix to ARINC
noticed by the Coverity run over the weekend.

Please note that the Coverity report is not about a real bug, but just
a latent one. As long as the arinc653 scheduler is supporting a single
physical cpu only, there is no real need for the lock when accessing
sched_priv->next_switch_time (the lock is thought to protect the list
of units/vcpus, not all the other fields of sched_priv).


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.