[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Scheduling anomaly with 4.0.0 (rc6)



One more followup on this:  It appears that I AM seeing
the "irreproduciblity" problem as well, even with vcpus=1
for all guests.  The range is a bit smaller, more like
5%, though.

Sum of cpusec across all domains plus dom0 seems to be
very reproducible (within < 0.5%).  Total elapsed time
for the whole workload is what is widely varying.

So, either I/O is taking much longer in some cases
(despite an identical workload on identical hardware);
or the scheduler is selecting the idle domain much
more frequently; or... something else?

> -----Original Message-----
> From: Dan Magenheimer
> Sent: Tuesday, April 06, 2010 10:52 AM
> To: George Dunlap
> Cc: Xen-Devel (xen-devel@xxxxxxxxxxxxxxxxxxx)
> Subject: RE: [Xen-devel] Scheduling anomaly with 4.0.0 (rc6)
> 
> Hi George --
> 
> Thanks again for the reply.  Hope it's OK if I go back
> on-list...  I'm hoping others may be able to reproduce
> as my ability to experiment is limited now (see below).
> 
> > From: George Dunlap [mailto:George.Dunlap@xxxxxxxxxxxxx]
> >
> > 1) Make a large ramdisk in each VM, big enough for the whole kernel
> > tree and binaries.  Do the build there, and see if you have the same
> > discrepancy.
> 
> My test domains have 384MB each.  Dom0 has 256MB.
> (Total physical RAM is only 2GB.)  So this isn't
> really an option.
> 
> > 2) Play with the dom0 io scheduler and see if it has an effect.  If
> > your current one is "noop", that's suspicious; see if "cfq" works
> > better.
> 
> On dom0, /sys/block/sda/queue/scheduler shows [cfq].
> Don't know if this matters but /sys/block/tapdev*/queue/scheduler
> show [noop].
> 
> > 3) Take a trace of just the scheduling events, using xentrace...
> 
> I lost about a week of test runs that I'm working on for
> Xen Summit and have to re-do those before I do much
> experimenting, but will try out some of your ideas when
> my (week of redo) test runs are done.  In the meantime, I'm
> still monitoring the test runs that I am running now.
> (I need a reliable set of non-tmem runs as a base
> to compare various tmem runs against.)
> 
> I reported two problems that we can call:
> 1) "racing ahead", where one of a pair of identical domains
>    seems to get a lot more cycles than the other
> 2) "irreproducibility", where two seemingly identical
>    and heavily overcommitted test runs have timing results
>    that differ by an unreasonable amount (6-7%)
> 
> After reducing my test domains to a single vcpu, the
> "irreproducibility" problem seems to be greatly reduced.
> I made three runs and they differ by <0.3%.  So as
> best I can tell, this problem requires multi-vcpu domains.
> (Actually, I changed from "file" to "tap:aio" also so
> it could be that too.)
> 
> However, with:
> 
> a) vcpus=1 for the test domains (see previous post) and
> b) vcpus=1 for test domains and dom0_max_vcpus=1
> 
> I am still seeing the "racing ahead" problem.  On
> a current run of (b)
> 
> 142s dom0
> 479s 64-bit #1
> 454s 64-bit #2 <-- 6% less
> 536s 32-bit #1
> 447s 32-bit #2 <-- 16% less!
> 
> Again, this is a transitory oddity that may shed some
> light... after completion of the workload, the runtimes
> are very similar THOUGH #2 seems to always be the
> slower of the two by a small amount (<0.5%).
> 
> Thanks,
> Dan
> 
> > -----Original Message-----
> > From: George Dunlap [mailto:George.Dunlap@xxxxxxxxxxxxx]
> > Sent: Tuesday, April 06, 2010 5:24 AM
> > To: Dan Magenheimer
> > Subject: Re: [Xen-devel] Scheduling anomaly with 4.0.0 (rc6)
> >
> > How much memory does each VM have?  Another possibility is that this
> > has to do with unfairness in the block driver servicing requests.
> > Three ways you could test this hypothesis.
> >
> > 1) Make a large ramdisk in each VM, big enough for the whole kernel
> > tree and binaries.  Do the build there, and see if you have the same
> > discrepancy.
> >
> > 2) Play with the dom0 io scheduler and see if it has an effect.  If
> > your current one is "noop", that's suspicious; see if "cfq" works
> > better.
> >
> > 3) Take a trace of just the scheduling events, using xentrace, and
> use
> > xenalyze to see how much time each vcpu is spending running,
> runnable,
> > and blocked (waiting for the cpu).  If the scheduler is being unfair,
> > then some vcpus will spend more time "runnable" than others.  If it's
> > something else (the dom0 disk scheduler being unfair, or the vm just
> > using different amounts of memory) then "runnable" will not be
> > considerably higher.
> >
> > To do #3:
> >
> > # xentrace -D -e 0x28000 -S 32 /tmp/filename.trace
> >
> > Then download:
> > http://xenbits.xensource.com/ext/xenalyze.hg
> >
> > Make it, and run the following command:
> >
> > $ xenalyze -s --cpu-hz [speed-in-gigahertz]G filename.trace >
> > filename.summary
> >
> > The summary file breaks information down by domain, then vcpu; look
> at
> > the "runstates" for each vcpu (running, runnable, blocked) and
> compare
> > them.
> >
> >  -George
> >
> > On Tue, Apr 6, 2010 at 12:17 AM, Dan Magenheimer
> > <dan.magenheimer@xxxxxxxxxx> wrote:
> > > For the record, I am seeing the same problem (first one,
> > > haven't yet got multiple runs) with vcpus=1 for all domains.
> > > Only on 32-bit this time and only 20%, but those may
> > > be random scheduling factors.  This is also with
> > > tap:aio instead of file so as to eliminate dom0 page
> > > cacheing effects.
> > >
> > >  394s dom0
> > > 2265s 64-bit #1
> > > 2275s 64-bit #2
> > > 2912s 32-bit #1
> > > 2247s 32-bit #2 <-- 20% less!
> > >
> > > I'm going to try a dom0_vcpus=1 run next.
> > >
> > >> -----Original Message-----
> > >> From: Dan Magenheimer
> > >> Sent: Monday, April 05, 2010 2:18 PM
> > >> To: George Dunlap
> > >> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> > >> Subject: RE: [Xen-devel] Scheduling anomaly with 4.0.0 (rc6)
> > >>
> > >> Thanks for the reply!
> > >>
> > >> Well I'm now seeing something a little more alarming:  Running
> > >> an identical but CPU-overcommitted workload (just normal PV
> domains,
> > >> no tmem or ballooning or anything), what would you expect the
> > >> variance to be between successive identical measured runs
> > >> on identical hardware?
> > >>
> > >> I am seeing total runtimes, both measured by elapsed time and by
> > >> sum-of-CPUsec across all domains (incl dom0), vary by 6-7% or
> more.
> > >> This seems a bit unusual/excessive to me and makes it very hard
> > >> to measure improvements (e.g. by tmem, for an upcoming Xen summit
> > >> presentation) or benchmark anything complex.
> > >>
> > >> > Is it possible that Linux is just favoring one vcpu over the
> other
> > >> for
> > >> > some reason?  Did you try running the same test but with only
> one
> > VM?
> > >>
> > >> Well "make -j8" will likely be single-threaded part of the time,
> > >> but I wouldn't expect that to make that big a difference between
> > >> two identical workloads.
> > >>
> > >> I'm not sure I understand how I would run the same test with
> > >> only one VM when the observation of the strangeness requires
> > >> two VMs (and even then must be observed at random points during
> > >> execution).
> > >>
> > >> > Another theory would be that most interrupts are delivered to
> vcpu
> > 0,
> > >> > so it may end up in "boost" priority more often.
> > >>
> > >> Hmmm... I'm not sure I get that, but what about _physical_ cpu 0
> > >> for Xen?  If all physical cpu's are not the same and one VM
> > >> has an affinity for vcpu0-on-pcpu0 and the other has an affinity
> > >> for vcpu1-in-pcpu0, would that make a difference?
> > >>
> > >> But still, 40% seems very large and almost certainly a bug,
> > >> especially given the new observations above.
> > >>
> > >> > -----Original Message-----
> > >> > From: George Dunlap [mailto:George.Dunlap@xxxxxxxxxxxxx]
> > >> > Sent: Monday, April 05, 2010 8:44 AM
> > >> > To: Dan Magenheimer
> > >> > Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> > >> > Subject: Re: [Xen-devel] Scheduling anomaly with 4.0.0 (rc6)
> > >> >
> > >> > Is it possible that Linux is just favoring one vcpu over the
> other
> > >> for
> > >> > some reason?  Did you try running the same test but with only
> one
> > VM?
> > >> >
> > >> > Another theory would be that most interrupts are delivered to
> vcpu
> > 0,
> > >> > so it may end up in "boost" priority more often.
> > >> >
> > >> > I'll re-post the credit2 series shortly; Keir said he'd accept
> it
> > >> > post-4.0.  You could try it with that and see what the
> performance
> > is
> > >> > like.
> > >> >
> > >> >  -George
> > >> >
> > >> > On Fri, Apr 2, 2010 at 5:48 PM, Dan Magenheimer
> > >> > <dan.magenheimer@xxxxxxxxxx> wrote:
> > >> > > I've been running some heavy testing on a recent Xen 4.0
> > >> > > snapshot and seeing a strange scheduling anomaly that
> > >> > > I thought I should report.  I don't know if this is
> > >> > > a regression... I suspect not.
> > >> > >
> > >> > > System is a Core 2 Duo (Conroe).  Load is four 2-VCPU
> > >> > > EL5u4 guests, two of which are 64-bit and two of which
> > >> > > are 32-bit.  Otherwise they are identical.  All four
> > >> > > are running a sequence of three Linux compiles with
> > >> > > (make -j8 clean; make -j8).  All are started approximately
> > >> > > concurrently: I synchronize the start of the test after
> > >> > > all domains are launched with an external NFS semaphore
> > >> > > file that is checked every 30 seconds.
> > >> > >
> > >> > > What I am seeing is a rather large discrepancy in the
> > >> > > amount of time consumed "underway" by the four domains
> > >> > > as reported by xentop and xm list.  I have seen this
> > >> > > repeatedly, but the numbers in front of me right now are:
> > >> > >
> > >> > > 1191s dom0
> > >> > > 3182s 64-bit #1
> > >> > > 2577s 64-bit #2 <-- 20% less!
> > >> > > 4316s 32-bit #1
> > >> > > 2667s 32-bit #2 <-- 40% less!
> > >> > >
> > >> > > Again these are identical workloads and the pairs
> > >> > > are identical released kernels running from identical
> > >> > > "file"-based virtual block devices containing released
> > >> > > distros.  Much of my testing had been with tmem and
> > >> > > self-ballooning so I had blamed them for awhile,
> > >> > > but I have reproduced it multiple times with both
> > >> > > of those turned off.
> > >> > >
> > >> > > At start and after each kernel compile, I record
> > >> > > a timestamp, so I know the same work is being done.
> > >> > > Eventually the workload finishes on each domain and
> > >> > > intentionally crashes the kernel so measurement is
> > >> > > stopped.  At the conclusion, the 64-bit pair have
> > >> > > very similar total CPU sec and the 32-bit pair have
> > >> > > very similar total CPU sec so eventually (presumably
> > >> > > when the #1's are done hogging CPU), the "slower"
> > >> > > domains do finish the same amount of work.  As a
> > >> > > result, it is hard to tell from just the final
> > >> > > results that the four domains are getting scheduled
> > >> > > at very different rates.
> > >> > >
> > >> > > Does this seem like a scheduler problem, or are there
> > >> > > other explanations? Anybody care to try to reproduce it?
> > >> > > Unfortunately, I have to use the machine now for other
> > >> > > work.
> > >> > >
> > >> > > P.S. According to xentop, there is almost no network
> > >> > > activity, so it is all CPU and VBD.  And the ratio
> > >> > > of VBD activity looks to be approximately the same
> > >> > > ratio as CPU(sec).
> > >> > >
> > >> > > _______________________________________________
> > >> > > Xen-devel mailing list
> > >> > > Xen-devel@xxxxxxxxxxxxxxxxxxx
> > >> > > http://lists.xensource.com/xen-devel
> > >> > >
> > >
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@xxxxxxxxxxxxxxxxxxx
> > > http://lists.xensource.com/xen-devel
> > >

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.