[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RFC] DVFS and Thermal management subsystem proposal


  • To: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Oleksii Moisieiev <Oleksii_Moisieiev@xxxxxxxx>
  • Date: Thu, 7 Jul 2022 10:35:00 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ba9u7UG7l6uCZKm4dIcahyaB8d9LWNf8nlzGyAamsfA=; b=TRRucJofv+YYiCCSUGIKfw8ub2LFdnrM4CU/MbPn88/1b2BOK1DY9u4qsYGkx0kiz/lSpV1tA6xJUvTrId3N/R47EujLadLexflV1wH4LAO8gmp2cRmt9wU9nOdrhUjA8Nws2VxavnzeQ4TZkPNSLnpY3AbEITHV41LC5/3Y2eWn0ifNCCOZM57f85fTdJuYMlqGojC4Hee6NjbkF3Ynf/LZrX+HLbVzSCP1MFfUmleQTOkiIsoetHHVb3LYlp27DpAW6Zrz3ZeW7pMHlxJkeGxxE6x2ZoGmbyHHzGqVmqMEZuY3RvnyRpkIJvCKOlKbBp3wRhloH/Dbe5JQorUY6Q==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=S3AWg4X8Q4Giq3Rz8CEJkdrNx96HxZi4Pz2exBKP/j2TCkrk3F3dONvkFza/iY8SqpsD9B+x/fFgCYf2TUA2HfTMtLhO5IyVKmtDoCaKeogcAk4D0bulm9CnZykKz55wmUDd7mTXksTP87TKkEYZ8hEbKk/rWEX1oeZJ7DNp0vuoaZLlTO/hTfxQ8lnfPNHWF2O43xYeWBEo3U8sjCLbbRxAi967p3tzISI5GkQxCT9eQ+6MgVbM1M3n9BOgEhTdFUr1aLXR7x8A95yMe9VU/9GA+PEC9JRx3ZxSpOBJhQe6NM/yqGwMnfAiOEn0qiW3DyitWRf/vp0NnfJH9QoB9g==
  • Cc: Jan Beulich <jbeulich@xxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Wei Liu <wl@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Oleksii Moisieiev <Oleksii_Moisieiev@xxxxxxxx>, Nick Rosbrook <rosbrookn@xxxxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>
  • Delivery-date: Thu, 07 Jul 2022 10:38:01 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHYke02Cidd2c7KkEyfmAiMVFzcyw==
  • Thread-topic: [RFC] DVFS and Thermal management subsystem proposal

# Synopsis
This document is intended to describe the design of the thermal based cpu
throttling in virtualized environments. The goal is to provide generic thermal
management subsystem, which should work with existing cpufreq subsystem in XEN
and could be used on various architectures and hardware.

# Cpufreq subsystem in XEN

## Brief overview

   Governors
+--------------------+
| +----------------+ |  struct cpufreq_governor {
| |  ondemand      | |      .name
| +----------------+ |      .governor
| +----------------+ |      .handle_option
| |  powersave     | |  }
| +----------------+ |
| +----------------+ |                              +----------------------+
| |  performance   | |->cpufreq_register_governor() | +-------------------+|
| +----------------+ |                              | |  cpufreq_dev_drv  ||
| +----------------+ |   cpufreq_register_driver()->| +-------------------+|
| |  userspace     | |                              | +-------------------+|
| +----------------+ |                              | |     ...           ||
| +----------------+ |                              | +-------------------+|
| |  ...           | |    struct cpufreq_driver {   +----------------------+
| +----------------+ |       .init                  +----------------------+
+--------------------+       .verify                |    Hardware          |
                             .setpolicy             +----------------------+
                             .update
                             .target
                             .get
                             .getavg
                             .exit
                          }

Cpufreq subsystem consists of 2 parts:
1) Cpufreq governor, which should be registered using cpufreq_register_governor
call;
2) Cpufreq driver, which provides access to the hardware should be registered
using cpufreq_register_driver call.

## Hardware drivers

There are two Cpufreq hardware drivers implemented by us (see Appendix 1 and
Appendix 2) to provide support for Rcar-3 and i.MX8 boards. Those drivers are
designed to support thermal throttling subsystem. They are going to be the part
of the contribution package.

## Configuration options

Cpufreq subsystem enables with the following config param:
+-----------------------------------------------------------------------------+
CONFIG_HAS_CPUFREQ=y
+-----------------------------------------------------------------------------+

Cpufreq device driver is platform specific and can be selected on compile time
by setting config parameter:
+-----------------------------------------------------------------------------+
CONFIG_CPUFREQ_XXX
+-----------------------------------------------------------------------------+
Where XXX is the platform name.
Additional configuration is also possible. This could be done by device tree
nodes or using ACPI configuration. Current implementation supports only
device-tree configuration.
Device tree configuration is defined by the cpufreq driver implementation and
mostly using device-tree bindings from linux kernel. Linux kernel defines
common and platform specific cpufreq bindings.
See [0] /Documentation/devicetree/bindings/cpufreq and
[0] /Documentation/devicetree/bindings/opp for details.
Some examples can be found in Appndix 1 and Appendix 2.

Cpufreq driver initializes on Xen start based on the configuration parameters.
Only one cpufreq device driver could be enabled on system. Switching to the diff
Cpufreq hardware driver should be probed based on Device-tree nodes or ACPI
configuration.

The default governor can be set from the xen-bootargs and has the following
format:
+-----------------------------------------------------------------------------+
cpufreq=xen:ondemand
+-----------------------------------------------------------------------------+

xl.cfg (guest configuration files) support the following configuration option:
guestpm. It defines PM policy for the given guest. For example:
+-----------------------------------------------------------------------------+
guestpm = "0-7"
+-----------------------------------------------------------------------------+
guestpm = "0-7" line allows guest to choose OPP levels from 0 to 7 out of 15.
Higher OPP levels will be ignored by hypervisor.


# XEN Dynamic Thermal management design

## Synopsis

Introducing the design of the Dynamic Thermal Management for Xen hypervisor.
This feature is an enhancement of the Xen DVFS feature and will allow system
admin to configure different thermal governors which will perform CPU
throttling, based on the CPU cores temperature and thermal configuration.

## Top level design.

+-----------------------------------------------+
|    XEN                                        |
|              +-------------------+            |
|              |      Thermal      |            |
|       +----->|     Governor      |            |
|       |      +---------|---------+            |
|       |                |                      |
|       |                +-------+              |
|       |                        |              |
|  +------------------+  +------------------+   |
|  |   Thermal        |  |    Cpufreq       |   |
|  |   Driver         |  |                  |   |
|  +------------------+  +------------------+   |
|                                               |
+-----------------------------------------------+
                    ^
                    |
                    |
           +--------v--------+
           |                 |
           |    Hardware     |
           |                 |
           +-----------------+


## Thermal management subsystem design in XEN

 +------------------+
 | +--------------+ |
 | |  powersave   | |               struct thermal_governor {
 | +--------------+ |                   .name
 | +--------------+ |                   .governor
 | |   stepwise   | |<------------+     .handle_option
 | +--------------+ |             | }
 | +--------------+ |             |
 | |     ...      | |             |
 | +--------------+ |             |
 +------------------+             v
          +----------------->register_thermal_governor()
          |
+---------v--------+                         Polling temperature
|   dyn_thermal    |<--------+             +--------------------+
+------------------+         +------------>|  polling_handler() |
                                           +--------------------+
                                          +-------------------------------+
 register_thermal_driver()                | __cpufreq_driver_target(HIGH) |
         +                                +-------------------------------+
                       struct thermal_driver {   Set HIGH priority to the
 +------------------+     .name                  target policy. So this
 |  thermal_driver  |     .get_trips             configuration will override
 +------------------+     .get_temp              cpufreq governor
                          .set_alarm_temp
 +------------------+  }
 | thermal_sensors  |
 +------------------+

Dynamic thermal feature consists of the 2 entities: thermal governor and driver

Thermal governor should be registered using register_thermal governor and will
provide the following interface:

+-----------------------------------------------------------------------------+
struct thermal_governor {
    .name = "name"
    .governor = gov_dbs,
    .handle_option = handle_opt
    .temp_handler = t_handler
}
+-----------------------------------------------------------------------------+

Where governor should process commands (start/stop/event). Event
command is needed if hw driver supports temp_alarm set. Governor is also
responsible for polling temperature and do throttling by setting cpufreq
policy. Cpufreq policy will be set with the priority, HIGH to override commands
from cpufreq_governor. Commands from cpufreq governor should be ignored until
throttling is in progress.

Thermal driver should provide access to the hardware and give interface to the
information. Thermal driver is responsible for the configuration and should
provide this configuration to governor. We are planning to provide support of
the Rcar-3 and i.MX8 boards (see Appendix 3 and Appendix 4).

+-----------------------------------------------------------------------------+
thermal_driver {
     .name
     .get_trips
     .get_temp
     .set_alarm_temp
}
+-----------------------------------------------------------------------------+

## Governors

In Linux Kernel there is an entity called thermal governor, which
is responsible for the system behaviour when critical temperatures were
reached. The following governors are going to be implemented in Xen:

### Powersave governor

Sets minimal cpu frequency if passive trip temperature was reached. Rebooting
board on critical temperature.

#### Fair-share governor

Using 3 parameters to calculate throttle state: P1: max throttle state; P2:
percentage[I]/100. Shows how effective device is; P3:
cur_trip_level/max_no_of_trips. New cpu state of CPU = P3 * P2 * P1

#### Step-wise governor

Sequentially switching state upper if temperature is rising and lower
otherwise.

#### User-space governor

Notifies guests when trip temperature was reached by setting flag in xenhypfs.

### Thermal governor configuration

Thermal governor should be enabled in Xen config paramterers:
+-----------------------------------------------------------------------------+
CONFIG_HAS_THERMAL=y
CONFIG_GOV_THERMAL_FAIR_SHARE=y
CONFIG_GOV_THERMAL_STEP_WISE=y
CONFIG_GOV_THERMAL_POWERSAVE=y
CONFIG_GOV_THERMAL_USERSPACE=y
+-----------------------------------------------------------------------------+
Where CONFIG_HAS_THERMAL enables Dynamic Thermal Management. Other parameters
enable different thermal governors in system. The default governor is STEP_WISE
or the first in list if wasn’t set in cmdline or STEP_WISE was not enabled.

In current implementation, thermal driver is using device-tree nodes to probe
device driver. ACPI configuration is not the part of current implementation.
Thermal device driver defines the device-tree configuration format based on
thermal device tree bindings from the Linux kernel source code.
See [0] /Documetation/devicetree/bindings/thermal for details.

Thermal governor can be configured in xen-bootargs command line by adding the
following parameter:

+-----------------------------------------------------------------------------+
thermal=xen:stepwise
+-----------------------------------------------------------------------------+

Xenhypfs utility can be used to give the current state of the thermal:
+-----------------------------------------------------------------------------+
>xenhypfs ls /thermal/
thermal_governor
avail_governors
Trips
Throttle
current_temp
>xenhypfs cat /thermal/thermal_governor
stepwise
>xenhypfs cat /thermal/avail_governors
stepwise powersave userspace
>xenhypfs cat /thermal/trips/
107(passive) 117(critical)
>xenhypfs cat /thermal/throttle
0
>xenhypfs cat /therml/current_temp/0
85(cluster 0)
>xenhypfs cat /therml/current_temp/1
87(cluster 1)
+-----------------------------------------------------------------------------+

Thermal governor can be changed by the following command:
+-----------------------------------------------------------------------------+
>xenhypfs write /thermal/thermal_governor powersave
+-----------------------------------------------------------------------------+

## Summary

The proposed feature will provide smarter way to do throttling in case of
thermal alarm in XEN.

# Appendix 1. Rcar-3 cpufreq driver
The solution for Rcar Gen3 platform consists of the following software
components:
• ARM Trusted Firmware, which acts as SCP module.
• XEN Hypervisor, which bears set of cpufreq governors.
ARM Trusted Firmware implements SCMI protocol with SMCs as the mailbox
interface. ARM TF is capable of controlling performance state of both Cortex
A57 cluster and Cortex A53 cluster.

Active governor desides, which cluster should be altered and configure
performance by setting OPP level in HW.
HW driver access ATF via SCMI protocol and set's the requested performance
level.

# Appendix 2 i.MX8 cpufreq driver
The solution for i.MX8 is similar to the Rcar-3 as it has the same components
involved:
* ARM Trusted Firmware, which provides SCP protocol.
* XEN Hypervisor.
* SC Firmware with alters performance level.

i.MX8 cpufreq driver using SCFW interface to access to cpu clusters: A53 and
A72. ARM TF is used to control cpu frequency of both clusters using SMC
messages. SCFW interface can't be used to control cpu performance, just to get
the existing performance state because of the board implementation limitations.

Cpufreq device driver using device-tree bindings to receive to opp-tables
configuration. See [0] /Documetation/devicetree/bindings/opp for details.

# Appendix 3 Rcar-3 thermal driver
The solution for Rcar Gen3 platform allows thermal subsystem to access the
hardware and read sensors values. Driver is configured from the device-tree.
Hardware is able to generate IRQ when critical temperature was reached. Thermal
driver handles this IRQ and send event to thermal governor.

Themal device driver using rcar-gen3-thermal bindings for the configuration.
See [0] Documentation/devicetree/bindings/thermal/rcar-gen3-thermal.yaml for
details. DOMID_XEN owner is set to the processed nodes, so those nodes shall
not be passed to the guests.

# Appendix 4 i.MX8 thermal driver
The solution for i.MX8 board allows thermal subsystem to read thermal sensors
using SCFW interface to access hardware. Current implementation follows the
implementation of imx_sc_thermal in linux kernel. Alarm events are sent based
on polling in the separate thread (used timer mechanism). Polling timeouts are
set in the Device-tree node.

i.MX8 thermal device driver using imx-thermal device tree bindings for the
configuration.
See [0] /Documentation/devicetree/bindings/thermal/imx-thermal.yaml for
details. DOMID_XEN owner set to the processed node, so it won't be passed to
the guest.

# Links
[0] https://elixir.bootlin.com/linux/latest/source

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.