Auto Scaling OpenStack Instances Using CPU Metrics with Aodh, Heat, Ceilometer, and Gnocchi

18 Nov 2024

OpenStack site

1. Introduction

In the OpenStack Rocky release, the cpu_util meter was deprecated in Ceilometer and subsequently removed in the Stein release. As a result, we now need to use the cpu metric in Heat templates for auto scaling. The cpu metric is cumulative, increases over time, and is measured in nanoseconds (ns).

To better understand this metric, let’s start by reviewing some basic OpenStack metric commands.

2. Instance Metrics

Checking Available Metrics for an Instance

To list the metrics available for a specific instance, we can use the command: “openstack metric resource show <instance_uuid>”. For example, to display the metrics for the instance with UUID 2c6797eb-d9ac-4ece-8b62-2fd93713544a:

$ openstack metric resource show -c metrics -t instance 2c6797eb-d9ac-4ece-8b62-2fd93713544a
+---------+---------------------------------------------------------------------+
| Field   | Value                                                               |
+---------+---------------------------------------------------------------------+
| metrics | compute.instance.booting.time: 41c2ad16-68f0-47e6-af3e-8fc5b285b743 |
|         | cpu: 1b7755e0-5042-42ca-a04f-742a701dc30e                           |
|         | disk.ephemeral.size: 2c1341e6-e28f-47cc-af7a-17e10fadb522           |
|         | disk.root.size: c84252bb-ca9b-47fa-a46f-6d985704c2b9                |
|         | memory.resident: 2ded7cf8-cb27-481f-bcd8-2a55f7984193               |
|         | memory.swap.in: 5cd5be1f-9982-432b-8500-a1515bd7f071                |
|         | memory.swap.out: 27e9244d-c9ad-4d0d-8572-75b2f8b34520               |
|         | memory.usage: e6ea4bfb-2ebe-48a1-8fce-0114f3a7096c                  |
|         | memory: 006c8422-b5f2-4f3e-a3d7-00f7a258373e                        |
|         | vcpus: 8736272c-283d-4508-bb21-b0979418be14                         |
+---------+---------------------------------------------------------------------+

Viewing Metric Details

To view the details of the cpu metric for the instance, use: “openstack metric show --resource-id <instance_uuid> cpu”:

$ openstack metric show --resource-id 2c6797eb-d9ac-4ece-8b62-2fd93713544a cpu
+--------------------------------+-------------------------------------------------------------------+
| Field                          | Value                                                             |
+--------------------------------+-------------------------------------------------------------------+
| archive_policy/name            | ceilometer-low-rate                                               |
| creator                        | e1684780e5e543658ba24bd10bf30e23:e1ea3ce6647f4988ae264a6e81084c7a |
| id                             | 1b7755e0-5042-42ca-a04f-742a701dc30e                              |
| name                           | cpu                                                               |
| resource/created_by_project_id | e1ea3ce6647f4988ae264a6e81084c7a                                  |
| resource/created_by_user_id    | e1684780e5e543658ba24bd10bf30e23                                  |
| resource/creator               | e1684780e5e543658ba24bd10bf30e23:e1ea3ce6647f4988ae264a6e81084c7a |
| resource/ended_at              | None                                                              |
| resource/id                    | 2c6797eb-d9ac-4ece-8b62-2fd93713544a                              |
| resource/original_resource_id  | 2c6797eb-d9ac-4ece-8b62-2fd93713544a                              |
| resource/project_id            | e51c9ddf3b154e7da95ed0a8b4927601                                  |
| resource/revision_end          | None                                                              |
| resource/revision_start        | 2024-11-18T14:04:43.428610+00:00                                  |
| resource/started_at            | 2024-11-18T13:30:19.286339+00:00                                  |
| resource/type                  | instance                                                          |
| resource/user_id               | 008138dasda933e19f24c0e6aa19634c11366aa8b8as9adv830ad0ad9a2f961a  |
| unit                           | ns                                                                |
+--------------------------------+-------------------------------------------------------------------+

This output shows that the metric uses the unit “ns” (nanoseconds) and is associated with the archive policy ceilometer-low-rate.

To inspect that archive policy, we can run the command: “openstack metric archive-policy show <archive_policy_name>”:

$ openstack metric archive-policy show ceilometer-low-rate
+---------------------+------------------------------------------------------------------+
| Field               | Value                                                            |
+---------------------+------------------------------------------------------------------+
| aggregation_methods | mean, rate:mean                                                  |
| back_window         | 0                                                                |
| definition          | - timespan: 30 days, 0:00:00, granularity: 0:05:00, points: 8640 |
| name                | ceilometer-low-rate                                              |
+---------------------+------------------------------------------------------------------+

The archive policy ceilometer-low-rate offers the aggregation methods mean and rate:mean, with a granularity of 5 minutes (300 seconds). Granularity defines the time interval at which the data points are stored.

In aggregation methods, mean calculates the average value, while rate:mean computes the average rate of change over time; that is, it reflects the differences between consecutive data points.

Querying CPU Metric Statistics

To retrieve CPU statistics for an instance, you can use either of the following commands:

$ openstack metric measures show --utc --aggregation mean -r 2c6797eb-d9ac-4ece-8b62-2fd93713544a cpu
+---------------------------+-------------+------------------+
| timestamp                 | granularity |            value |
+---------------------------+-------------+------------------+
| 2024-11-18T13:30:00+00:00 |       300.0 |    97840000000.0 |
| 2024-11-18T13:35:00+00:00 |       300.0 |   309460000000.0 |
| 2024-11-18T13:40:00+00:00 |       300.0 |   908550000000.0 |
| 2024-11-18T13:45:00+00:00 |       300.0 |  1507590000000.0 |
| 2024-11-18T13:50:00+00:00 |       300.0 |  2108360000000.0 |


$ openstack metric aggregates --resource-type=instance '(metric cpu mean)' id=2c6797eb-d9ac-4ece-8b62-2fd93713544a
+-----------------------------------------------+---------------------------+-------------+------------------+
| name                                          | timestamp                 | granularity |            value |
+-----------------------------------------------+---------------------------+-------------+------------------+
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/mean | 2024-11-18T13:30:00+00:00 |       300.0 |    97840000000.0 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/mean | 2024-11-18T13:35:00+00:00 |       300.0 |   309460000000.0 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/mean | 2024-11-18T13:40:00+00:00 |       300.0 |   908550000000.0 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/mean | 2024-11-18T13:45:00+00:00 |       300.0 |  1507590000000.0 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/mean | 2024-11-18T13:50:00+00:00 |       300.0 |  2108360000000.0 |

Both commands will produce the same results. The values represent cumulative CPU usage in nanoseconds.

You can specify the granularity using the --granularity option. Note that if you specify a granularity that doesn’t exist, you’ll receive an error:

$ openstack metric aggregates --granularity 120 --resource-type=instance '(metric cpu mean)' id=2c6797eb-d9ac-4ece-8b62-2fd93713544a
{'cause': "Metrics can't being aggregated", 'reason': 'Granularities are missing', 'detail': [['cpu', 'mean', 120.0]]} (HTTP 400)

Understanding Aggregation Methods

Aggregation methods in Gnocchi determine how raw metric data is processed and summarized over specified time intervals. For the cpu metric, it offers two aggregation methods:

Example of using the rate:mean aggregation method for the instance with UUID 2c6797eb-d9ac-4ece-8b62-2fd93713544a.

$ openstack metric measures show --utc --aggregation rate:mean -r 2c6797eb-d9ac-4ece-8b62-2fd93713544a cpu
+---------------------------+-------------+---------------+
| timestamp                 | granularity |         value |
+---------------------------+-------------+---------------+
| 2024-11-18T13:35:00+00:00 |       300.0 | 211620000000.0 |
| 2024-11-18T13:40:00+00:00 |       300.0 | 599090000000.0 |
| 2024-11-18T13:45:00+00:00 |       300.0 | 599040000000.0 |
| 2024-11-18T13:50:00+00:00 |       300.0 | 600770000000.0 |


$ openstack metric aggregates --resource-type=instance '(metric cpu rate:mean)' id=2c6797eb-d9ac-4ece-8b62-2fd93713544a
+----------------------------------------------------+---------------------------+-------------+----------------+
| name                                               | timestamp                 | granularity |          value |
+----------------------------------------------------+---------------------------+-------------+----------------+
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:35:00+00:00 |       300.0 | 211620000000.0 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:40:00+00:00 |       300.0 | 599090000000.0 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:45:00+00:00 |       300.0 | 599040000000.0 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:50:00+00:00 |       300.0 | 600770000000.0 |

With rate:mean, the values represent the average rate of CPU usage between data points (in nanoseconds). The first timestamp is omitted because there’s no previous data point for comparison.

3. Calculating CPU Usage Percentage

When an instance uses one vCPU at 100% utilization for the entire granularity period, it accumulates the maximum possible time of one CPU for that period. In other words, a vCPU running at 100% is busy for every nanosecond of the granularity period.

Max CPU Time per vCPU (nanoseconds) = Granularity (in seconds) * Nanoseconds per Second

An instance can consume as much CPU time as the number of vCPUs it has. For example, consider an instance with 1 vCPU, a granularity of 300 seconds (5 minutes), and 100% utilization. Formula to calculate the rate:mean Value:

rate:mean Value = Number of vCPUs * Granularity (seconds) * Nanoseconds per Second

Applying the values to the formula:

rate:mean Value = 1 * 300 * 1000000000 = 300,000,000,000 ns

The rate:mean value is 300,000,000,000 ns when the vCPU is fully utilized during the entire 300-second interval.

Therefore, to determine the CPU usage percentage of an instance, we can use the following formula:

CPU Usage (%) = ((value / 1,000,000,000 / granularity) * 100) / number_of_vCPUs

Notes:

Breaking it down:

Alternative Formula:

CPU Usage (%) = (value / (number_of_vCPUs * granularity * 1,000,000,000)) * 100

Suppose we have a rate:mean value of 600,000,000,000.0 ns for an instance with 2 vCPUs and a granularity of 300 seconds:

CPU Usage (%) = ((600,000,000,000.0 / 1,000,000,000 / 300) * 100) / 2
              = ((600 / 300) * 100) / 2
              = (2 * 100) / 2
              = 200 / 2
              = 100%

// Alternative Formula:
CPU Usage (%) = (600,000,000,000 / (2 * 300 * 1,000,000,000)) * 100
              = (600,000,000,000 / 600,000,000,000) * 100
              = 1 * 100
              = 100%

The instance used 100% of each vCPU over the granularity period.

You can perform this calculation directly within the openstack metric aggregates command by using mathematical expressions supported by Gnocchi.

$ openstack metric aggregates --resource-type instance --sort-column timestamp \
'(/ (* (/ (/ (metric cpu rate:mean) 1000000000) 300) 100) 2)' id=2c6797eb-d9ac-4ece-8b62-2fd93713544a
+----------------------------------------------------+---------------------------+-------------+--------------------+
| name                                               | timestamp                 | granularity |              value |
+----------------------------------------------------+---------------------------+-------------+--------------------+
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:35:00+00:00 |       300.0 |              35.27 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:40:00+00:00 |       300.0 |  99.84833333333334 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:45:00+00:00 |       300.0 |  99.83999999999999 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:50:00+00:00 |       300.0 | 100.12833333333333 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:55:00+00:00 |       300.0 | 100.07333333333335 |

$ # Alternative Formula
$ openstack metric aggregates --resource-type instance --sort-column timestamp \
'(* (/ (metric cpu rate:mean) (* 2 (* 300 1000000000))) 100)' id=2c6797eb-d9ac-4ece-8b62-2fd93713544a
+----------------------------------------------------+---------------------------+-------------+--------------------+
| name                                               | timestamp                 | granularity |              value |
+----------------------------------------------------+---------------------------+-------------+--------------------+
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:35:00+00:00 |       300.0 |              35.27 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:40:00+00:00 |       300.0 |  99.84833333333334 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:45:00+00:00 |       300.0 |  99.83999999999999 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:50:00+00:00 |       300.0 | 100.12833333333333 |
| 2c6797eb-d9ac-4ece-8b62-2fd93713544a/cpu/rate:mean | 2024-11-18T13:55:00+00:00 |       300.0 | 100.07333333333335 |

4. CPU threshold in Heat Auto Scaling Group

In Heat, you can configure an Aodh alarm using the resource type OS::Aodh::GnocchiAggregationByResourcesAlarm. Key configuration parameters:

To calculate the CPU threshold for a given utilization percentage, use the following formula:

Threshold = Percentage_in_Decimal × Number_of_vCPUs × 1,000,000,000 × Granularity

Notes:

Example with 1 vCPU and a granularity of 300 seconds:

Utilization (%) Decimal_Value_of_Pct Calculation Threshold_Value
100% 1 1 * 1 * 1000000000 * 300 300000000000.0
90% 0.9 0.9 * 1 * 1000000000 * 300 270000000000.0
80% 0.8 0.8 * 1 * 1000000000 * 300 240000000000.0
70% 0.7 0.7 * 1 * 1000000000 * 300 210000000000.0
60% 0.6 0.6 * 1 * 1000000000 * 300 180000000000.0
50% 0.5 0.5 * 1 * 1000000000 * 300 150000000000.0
40% 0.4 0.4 * 1 * 1000000000 * 300 120000000000.0
30% 0.3 0.3 * 1 * 1000000000 * 300 90000000000.0
20% 0.2 0.2 * 1 * 1000000000 * 300 60000000000.0
10% 0.1 0.1 * 1 * 1000000000 * 300 30000000000.0

Example with 2 vCPUs and a granularity of 300 seconds:

Utilization (%) Decimal_Value_of_Pct Calculation Threshold_Value
100% 1 1 * 2 * 1000000000 * 300 600000000000.0
90% 0.9 0.9 * 2 * 1000000000 * 300 540000000000.0
80% 0.8 0.8 * 2 * 1000000000 * 300 480000000000.0
70% 0.7 0.7 * 2 * 1000000000 * 300 420000000000.0
60% 0.6 0.6 * 2 * 1000000000 * 300 360000000000.0
50% 0.5 0.5 * 2 * 1000000000 * 300 300000000000.0
40% 0.4 0.4 * 2 * 1000000000 * 300 240000000000.0
30% 0.3 0.3 * 2 * 1000000000 * 300 180000000000.0
20% 0.2 0.2 * 2 * 1000000000 * 300 120000000000.0
10% 0.1 0.1 * 2 * 1000000000 * 300 60000000000.0

Example with 4 vCPUs and a granularity of 300 seconds:

Utilization (%) Decimal_Value_of_Pct Calculation Threshold_Value
100% 1 1 * 4 * 1000000000 * 300 1200000000000.0
90% 0.9 0.9 * 4 * 1000000000 * 300 1080000000000.0
80% 0.8 0.8 * 4 * 1000000000 * 300 960000000000.0
70% 0.7 0.7 * 4 * 1000000000 * 300 840000000000.0
60% 0.6 0.6 * 4 * 1000000000 * 300 720000000000.0
50% 0.5 0.5 * 4 * 1000000000 * 300 600000000000.0
40% 0.4 0.4 * 4 * 1000000000 * 300 480000000000.0
30% 0.3 0.3 * 4 * 1000000000 * 300 360000000000.0
20% 0.2 0.2 * 4 * 1000000000 * 300 240000000000.0
10% 0.1 0.1 * 4 * 1000000000 * 300 120000000000.0

Note: Adjust the calculations according to the number of vCPUs and granularity specific to your environment.

Sample Heat Template Snippet

resources:
  asg:
    type: OS::Heat::AutoScalingGroup
    properties:
      desired_capacity: 1
      max_size: 3
      min_size: 1
      resource:
        type: template_server.yaml
        properties:
          flavor: {get_param: flavor}
          image: {get_param: image}
          key_name: {get_param: key_name}
          subnet: {get_param: subnet}
          metadata: {"metering.server_group": {get_param: "OS::stack_id"},
                     "stack_id": {get_param: "OS::stack_id"},
                     "stack_name": {get_param: "OS::stack_name"}}

  cpu_utilization_alarm_high:
    type: OS::Aodh::GnocchiAggregationByResourcesAlarm
    properties:
      description: Scale up if CPU > 80%
      metric: cpu
      aggregation_method: rate:mean
      threshold: 480000000000.0  # Calculated threshold for 80% on 2 vCPUs
      granularity: 300
      evaluation_periods: 2
      resource_type: instance
      comparison_operator: gt
      alarm_actions:
        - str_replace:
            template: trust+url
            params:
              url: {get_attr: [scale_up_policy, signal_url]}
      query:
        str_replace:
          template: '{"=": {"server_group": "stack_id"}}'
          params:
            stack_id: {get_param: "OS::stack_id"}

The full Heat template is available at https://github.com/thobiast/openstack-heat-autoscale-samples.

You can filter instances by “metering” metadata added by the Heat template using:

$ openstack metric aggregates --resource-type instance --sort-column timestamp \
'(metric cpu rate:mean)' server_group=<stack_id>
$ openstack metric aggregates --resource-type instance --sort-column timestamp \
'(/ (* (/ (/ (metric cpu rate:mean) 1000000000) <granularity>) 100) <Number_of_vCPUs>)' \
server_group=<stack_id>
$ openstack metric aggregates --resource-type instance --sort-column timestamp \
'(aggregate mean (/ (* (/ (/ (metric cpu rate:mean) 1000000000) <granularity>) 100) <Number_of_vCPUs>)))' \
server_group=<stack_id>

Note: The most recent entry may not be accurate if metrics for all instances are not yet available in Gnocchi.

References: