[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [BUG] xen-mceinj tool testing cause dom0 crash
> -----Original Message----- > From: Zhang, Haozhong > Sent: Thursday, November 9, 2017 9:45 AM > To: Jan Beulich <JBeulich@xxxxxxxx>; Hao, Xudong <xudong.hao@xxxxxxxxx> > Cc: Julien Grall <julien.grall@xxxxxxx>; George Dunlap > <George.Dunlap@xxxxxxxxxx>; Lars Kurth <lars.kurth@xxxxxxxxxx>; xen- > devel@xxxxxxxxxxxxx > Subject: Re: [Xen-devel] [BUG] xen-mceinj tool testing cause dom0 crash > > On 11/07/17 01:37 -0700, Jan Beulich wrote: > > >>> On 07.11.17 at 09:23, <xudong.hao@xxxxxxxxx> wrote: > > >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx] > > >> Sent: Tuesday, November 7, 2017 4:09 PM > > >> >>> On 07.11.17 at 02:37, <xudong.hao@xxxxxxxxx> wrote: > > >> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx] > > >> >> Sent: Monday, November 6, 2017 5:17 PM > > >> >> >>> On 03.11.17 at 09:29, <xudong.hao@xxxxxxxxx> wrote: > > >> >> > We figured out the problem, some corner scripts triggered the > > >> >> > error injection at the same page (pfn 0x180020) twice, i.e. > > >> >> > "./xen-mceinj -t 0" run over one time, which resulted in Dom0 crash. > > >> >> > > >> >> But isn't this a valid scenario, which shouldn't result in a kernel > > >> >> crash? > > >> > What if > > >> >> two successive #MCs occurred for the same page? > > >> >> I.e. ... > > >> >> > > >> > > > >> > Yes, it's another valid scenario, the expect result is kernel crash. > > >> > > >> Kernel _crash_ or rather kernel _panic_? Of course without any > > >> kernel messages we can't tell one from the other, but to me this makes a > difference nevertheless. > > >> > > > Exactly, Dom0 crash. > > > > I don't believe a crash is the expected outcome here. > > > > This test case injects two errors to the same dom0 page. During the first > injection, offline_page() is called to set PGC_broken flag of that page. > During the > second injection, offline_page() detects the same broken page is touched > again, > and then tries to shutdown the page owner, i.e. dom0 in this case: > > /* > * NB. When broken page belong to guest, usually hypervisor will > * notify the guest to handle the broken page. However, hypervisor > * need to prevent malicious guest access the broken page again. > * Under such case, hypervisor shutdown guest, preventing recursive mce. > */ > if ( (pg->count_info & PGC_broken) && (owner = page_get_owner(pg)) ) > { > *status = PG_OFFLINE_AGAIN; > domain_shutdown(owner, SHUTDOWN_crash); > return 0; > } > > So I think Dom0 crash and the following machine reboot are the expected > behaviors here. > > But, it looks a (unexpected) page fault happens during the reboot. > Xudong, can you check whether a normal reboot on that machine triggers a > page fault? > Yes, a normal rebooting of Dom0 triggered a Xen page fault on Intel Skylake 4 sockets platform, but no page fault on Skylake 2 sockets system and Broadwell platforms. Haozhong, will you fix this page fault issue? Thanks, -Xudong _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |