[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]

On 09/25/2017 03:07 PM, Dario Faggioli wrote:

Hi Dario,

On Mon, 2017-09-25 at 09:46 +0000, osstest service owner wrote:
flight 113807 xen-unstable real [real]

So, triggered by this:

Tests which are failing intermittently (not blocking):
  test-armhf-armhf-xl-credit2 16 guest-start/debian.repeat fail in
113791 pass in 113807

I went having a look, and discovered that it's indeed happening that,
from time to time, we fail to create a guest, on ARM, with Credit2.

Looking here:

It seems to be happening only on the cubietracks, but in a non-linear
and non-deterministic fashion. E.g., 113791 failed on metzinger, which
is fine on 113800; 113611 and 113618 failed on baroque, which is fine
on 113638.

I don't see much in the logs, TBH, but both `xl vcpu-list' and the 'r'
debug key seem to suggest that vCPU 0 is running, while the other vCPUs
have never run... like it was an issue with secondary (v)CPU bringup.

It indeed shows up with Credit2, as it were _specific_ to it, but I'm
not 100% sure. In fact, it indeed seems to never show up here:

but it looks like it may have shown up in 112460 (but we don't have the
logs any longer):

So... ARM people? Does this ring any bell? Is this something known, or
easy to explain? What can I do for help?

It definitely rings a bell, I have seen similar trace in July and I have been working on a potential fix since then.

Most of the time guest-start/debian.repeat fails, vCPU 0 is in data/prefetch abort state. My guess is a latent cache bug that credit2 appears to expose.

Indeed, the arm32 kernel is using set/way cache flush instruction at boot time. They are used to clean one by one each level of caches on each CPUs.

At the moment, Xen does not trap those instructions. As you know cache may not be private to a given physical processors. So if you happen to migrate the vCPU to another physical CPU, you may hit stale data.

This means we have to trap and emulate set/way instructions. Per the ARM ARM and also experience emulating them is a non-trivial.

Thankfully, people are trying to get rid of those instructions. For instance arm64 Linux does not use it anymore. Sadly, arm32 linux maintainer does not want to remove them... This is also used by EDK2 at the moment.

The solution is to go through the P2M and clean & invalidate every page one by one. This process is really realy slow given Xen on Arm is always populating the P2M at guest creation.

So I have been working for the past 2 months to add PoD support on Arm. I have a proof of concept that boot a guest and properly handle set/way cache instructions.

I am still cleaning-up my work and hopefully can post a couple of series soon. This is not targeting Xen 4.10 and I am not even sure it would fix the problem here. But that's my best guess.


Julien Grall

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.