[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] pvops: Does PVOPS guest os support online "suspend/resume"



Hi,

I rechecked the different kernels today, and found that I made a mistake before. sorry for misleading you all:)

 

All in all, the problems should be concluded in the 2 items below:

1 the kernel 2.6.32 PVOPS guest os(I tested RHEL6.1 and RHEL6.3), does have bugs in ONLINE suspend/resume (checkpoint), which was,

as Shriram mentioned, fixed in:

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/drivers/xen/manage.c?id=b3e96c0c756211e805c6941d4a6e5f6e1995cb6b

2 the kernel above 3.0(I tested Ubuntu12.10 with kernel 3.5 and Ubuntu13.04 with kernel 3.8), they seem to have another "bug":

  1) if we set MULTI VCPUS for the guest os, it would have problems in resuming(to be correctly, it's thaw).

     In details:

         <1>set the guest os with 4 vcpus

             in dom1.cfg: vcpus=4

         <2>xl create dom1.cfg

             excute command "top -d 1" in guest dom1's vnc window

         <3>xl save -c dom1 /opt/dom1.save

         <4>after step <3>, we check the guest dom1's vnc window, and found that:

             kernel thread migration/1, migration/2, migration/3 got their cpu usage up to 100%

                   the guest os couldn't respond to any request such as mouse movement or keyboard input.

                   no "thaw" things printed in dom1's serial output.

 

  2) if we set only 1 vcpu for the guest os, it would thaw back and works fine.

  3) anyother odd thing is that: if we use the saved file generated in 2-1) to restore the guest, and then do online suspend/resume (xl save -c, checkpoint),

it would be fine, no problems occurred.

 

Such problem occurs on guest os with kernel 3.5/3.8(maybe other kernels as well, not tested). I hope that the steps I did was correct. 

Have you ever entercounter such "suspend/resume checkpoint on multi-vcpu guest os" problem?

 

-------

PS: BTW, I'm wondering why using freeze/thaw instead of suspend/resume would solve the problem with kernels below 3.0?

 It seems that blkfront_resume is still called if we use thaw method here, because blkfront has no available pm_op.

 

    static int device_resume(struct device *dev, pm_message_t state, bool async)

    {

         ââââ

                   if (dev->bus) {

                   if (dev->bus->pm) {

                            info = "bus ";

                            callback = pm_op(dev->bus->pm, state);

                   } else if (dev->bus->resume) {

                            info = "legacy bus ";

                            callback = dev->bus->resume;  //blkfront_resume is called here. here?

                            goto End;

                   }

         ââââ

         }

 

Best Regards!

 

-Gonglei

 

From: Shriram Rajagopalan [mailto:rshriram@xxxxxxxxx]
Sent: Tuesday, August 13, 2013 2:05 AM
To: Gonglei (Arei)
Cc: Konrad Rzeszutek Wilk; xen-devel@xxxxxxxxxxxxx; Zhangbo (Oscar); Luonengjun; ian.campbell@xxxxxxxxxx; stefano.stabellini@xxxxxxxxxxxxx; rjw@xxxxxxx; Yanqiangjun; Jinjian (Ken)
Subject: Re: [Xen-devel] pvops: Does PVOPS guest os support online "suspend/resume"

 

On Mon, Aug 12, 2013 at 10:19 AM, Gonglei (Arei) <arei.gonglei@xxxxxxxxxx> wrote:

> > Thanks for responding. We've tried kernel "3.5.0-17 generic" (ubuntu 12.10),
> the problem still exists.
>
> So you have not tried v3.10. v3.5 is ancient from the upstream perspective.
>

thank you, I didn't notice that, I would try 3.10 later.

 

 

3.5 may be ancient compared to 3.10 but from the suspend/resume support perspective,

I think things were fixed way back in 3.0 series.

 

 

yes, the purpose of using "-c" here is to do a "ONLINE" suspend/resume. such problem just occurs with ONLINE suspend/resume,
rather than OFFLINE suspend/resume. To be precisely, 2 examples are listed here below:
  <1>
  1) xl create dom1.cfg
  2) xl save -c dom1 /opt/dom1.save
     after this, the dom1 guest os has its io stucked. which means ONLINE suspend/resume has something wrong.
  3) xl destroy dom1
  4) xl restore /opt/dom1.save
     the restored dom1 works fine, which means OFFLINE suspend/resume is OK.

 

I am a bit lost here. Didnt we fix suspend/resume issues in the 3.0 release window.

I tested it with both xm and xl save (with/without -c option). That was also when I fixed

some bugs in "xl save -c" code and introduced a minimal xl remus implementation (which is a continuous xl save -c).

And we had blkfront et. al at that time too.

 

Did the distros miss some kernel config (iirc it was HIBERNATE_CALLBACKS) ?

 

So, did something fundamental change between 3.0 to 3.5, causing the "regression" that

Gonglei is seeing ?

 

 

  <2>
  1) xl create dom1.cfg
  2) xl save dom1 /opt/dom1.save
     no "-c" here, it would destroy the guest dom1 automatically.
  3) xl restore /opt/dom1.save
     the restored dom1 works fine, which means OFFLINE suspend/resume is OK.

 

This one always worked.. even with stock 2.6 kernels.

 

shriram 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.