[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CPU oversubscription =?=> spontaneous reboots



On Wed, Dec 04, 2024 at 10:21:38PM +0000, Mike wrote:
>
> The second domU is the sole worker node for the cluster.  The command that I
> ran in it that triggered the reboot was `kubectl delete -f` of a Deployment
> that was already running from an `apply`.

Okay, do you have a full list of what this command does?

Might it cause a crucial Xen domain to panic (domain 0) and this in turn
cause Xen to panic?


On Thu, Dec 05, 2024 at 03:15:10PM +0000, Mike wrote:
> Joost Roeleveld wrote:
> > How do you overcommit memory?
>
> I don't.  I took the memory from the other domU.

How much free memory does Xen have?  Try running `xl info`, what does
the "free_memory" line say?

Might be 0 if Xen is ballooning memory from domain 0 to handle
allocations.  If ballooning memory from domain 0 has been disabled this
should stay above 50 so Xen can allocate memory to handle activity.


On Thu, Dec 05, 2024 at 03:23:22PM +0000, Mike wrote:
> Paul Leiber wrote:
> > Could it be possible that it's not the activity on the DomU that is
> > triggering the reboot, but rather network activity between two DomUs?
> 
> Sure, that's possible.  The domUs are a k8s control and worker node,
> respectively, so they need to communicate with each other when I issue the
> `kubectl delete` that trigger it.
> 
> But I resolved the issue (for now) by increasing the control node's
> admittedly tight memory.  So that doesn't point to a network issue in my
> mind.

Is either of these also domain 0?  Domain 0 exhausting its free memory
and panicing might cause the issue you're describing.

> > What CPU architecture is your system based on?
> 
> amd64
> Intel Core i9-14900T

Apparently there is a major issue with 14900K processors.  I've been
reading mentions of other Intel 13xxx and 14xxx chips reputedly having
failures at a lower rates.

Right now there could still be configuration issues, but I would keep an
eye out for hardware failure.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.