[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-API] [XCP-1.1] High OVS cpu load and unresponsive host network while VMPR archive phase is running

We using antispoofing protection (I've publish patches few month ago in xen-api@) based on rules, applied from /etc/xensource/scripts/vif. Those rules looks like this:

$ofctl add-flow $p_bridge "in_port=$port priority=39000 dl_type=0x0800 nw_src=$IP dl_src=$mac idle_timeout=0 action=normal" $ofctl add-flow $p_bridge "in_port=$port priority=38500 dl_type=0x0806 dl_src=$mac nw_src=$IP idle_timeout=0 action=normal" $ofctl add-flow $p_bridge "in_port=$port priority=38000 idle_timeout=0 action=drop"

During abnormal activity last rule counter (DROP) is growing quickly, and VM migration to other hosts cause same symptom on new host.

We simply shutdown such VM's (because attempt to use non-assigned IP is violation of TOC for our services) and message to owner to ask them to fix problem.

It happens rarely (less then once in month), but happens.

03.08.2012 12:01, Christian Fischer ÐÐÑÐÑ:
On Thursday 02 August 2012 23:46:18 George Shuklin wrote:
In product environment I saw that behavior few times. ovs-* processes
starts to consume lot of cpu (over 100%) and start to cause packets drops.

That usually happens with 'hacked' customer VMs (sudden spike of
outgoing traffic, cpu, and in few cases we assisted in research, actual
trojans running on server because of some stupid php misconfiguration in
yet another phpbb/cms/durpal/etc).
We have no customer VMs there, and we watch the vm traffic. Nothing unusual. The
archive phase is running. It's 100% reproducible.

I suppose that, in my case, it has something to do with with the OpenFlow
controller (Citrix DVS Controller) we tried to evaluate. Currently we do some
tests in an testing environment to work out the problem.

But by the way, what do you do to protect your production environment against
crashing caused by flooding the network? IIRC Jesse Gross told something about
some work on patches preventing a single vm from being able to render the
network unresponsive, maybe a year ago. What's the state?

I'm not sure wat exactly happens, but my hypothesis is that it related
to amount of flows. Then trojan starts to flood out traffic to different
servers (smtp/www spam/etc) it cause lots of new connections..

On 01.08.2012 04:08, Ben Pfaff wrote:
Christian Fischer writes:
On Tuesday 31 July 2012 18:08:18 Ben Pfaff wrote:
Christian Fischer

We have no tagged vlans here, all physical switch ports running access
mode. I wouldn't say that network load is increased when this happens,
15 kpps. Network performance could be poor due either a vswitch issue
(runs at 180% CPU load if the vswitch log don't lie) or high load
on/cheep hardware of the customer shared backup storage. I've never
seen this stuff.
180% CPU load is impossible for OVS 1.0.1, which has only a
single procsss with a single thread.
Yes, that's right, but we run OVS 1.4.2

XCP build: 1.1.0-50674c
OVS build: 1.4.2
NICs: BCM5709 Gigabit TOE iSCSI Offload
OVS NIC bonding: active/active
Only the as-yet-unreleased post-1.8.0 Open vSwitch has more than
one process, and it still doesn't have multiple threads.

I suppose ovsdb-server and ovs-vswitchd could both go crazy at
the same time, but I haven't had any reports of that.

What process(es) add up to 180%?

Xen-api mailing list
Xen-api mailing list

Xen-api mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.