In the world of GPU-accelerated workloads, efficient resource management and scaling are crucial for optimal performance and cost. To address this need, the integration of NVIDIA DCGM Exporter with KEDA (Kubernetes Event-Driven Autoscaling) has emerged as a powerful solution. In this blog post, we will explore what NVIDIA DCGM Exporter and KEDA are, and delve into the benefits of integrating them.
ℹ️ Prerequisite Knowledge
This blog post assumes that readers have a foundational understanding of Kubernetes, Helm, Prometheus, and Grafana. Familiarity with these technologies is essential for following the integration process and implementing the steps provided.
Jump to
NVIDIA DCGM Exporter
NVIDIA Data Center GPU Manager (DCGM) Exporter is a component developed by NVIDIA that enables the monitoring and export of metrics related to GPU utilization and performance. It provides valuable insights into GPU metrics such as memory utilization, temperature, power usage, and more. DCGM Exporter collects these metrics from NVIDIA GPUs and exposes them in a format compatible with monitoring systems like Prometheus.
KEDA
KEDA is an open-source project, a CNCF project, that aims to simplify the autoscaling of Kubernetes workloads based on various external event sources. It enables developers to scale their applications dynamically based on metrics provided by event sources, such as message queues, HTTP requests, or custom metrics. KEDA acts as a bridge between Kubernetes and external event sources, allowing automatic scaling of resources in response to changes in workload demands.
Integration Benefits
The integration of NVIDIA DCGM Exporter with KEDA brings several advantages for GPU-accelerated workloads. KEDA can consume the exported GPU metrics from DCGM Exporter, through Prometheus, and trigger scaling events accordingly. This enables efficient resource allocation and ensures that GPU-accelerated applications can dynamically adapt to changing workload demands while maintaining optimized costs.
Setup
Now that we understand the objectives and benefits of integrating NVIDIA DCGM Exporter with KEDA, let’s proceed with the setup process:
Setup KEDA
Begin by setting up KEDA in your Kubernetes cluster. To spare on words, refer to the official KEDA documentation for detailed instructions on installation and configuration. It is fairly easy using Helm.
Setup NVIDIA DCGM Exporter
Next, install and configure NVIDIA DCGM Exporter on your Kubernetes cluster. Like with KEDA, please refer to the official docs.
Viewing Metrics on Grafana with Prometheus
To visualize the exported metrics, integrate Prometheus with DCGM Exporter. Configure Prometheus to scrape metrics from DCGM Exporter, and then set up Grafana as a visualization tool to create dashboards and charts based on the collected metrics. DCGM Exporter mainainers created this useful Grafana dashboard you can use.
Creating Autoscale Metric and KEDA ScaledObject
Once the metrics are visible in Grafana, you can define autoscaling rules based on the desired metric, such as GPU utilization. Create a KEDA ScaledObject, specifying the scaling rules and the metric source to be used (e.g., Prometheus). KEDA will continuously monitor the specified metric and trigger scaling events based on the defined rules.
In the following example I created a ScaledObject.yaml
for a deployment named my-app
.
Let’s review the scaling specs first:
minReplicaCount
: The minimum number of replicas to ensure availability.maxReplicaCount
: The maximum number of replicas to restrict excessive scaling.pollingInterval
: The interval in seconds at which Prometheus queries are made to collect metrics.cooldownPeriod
: The time in seconds to wait before scaling down when the threshold is not met.
As for the query, I used the DCGM_FI_DEV_GPU_UTIL
metric exported by DCGM Exporter to scale based on a threshold of 60%
GPU utilization.
To calculate the GPU utilization, I utilized the sum()
and rate()
functions. These functions aggregate the utilization data over a timeframe of 2 minutes across all containers named my-app
.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: my-app-scaledobject
namespace: my-app
spec:
minReplicaCount: 1
maxReplicaCount: 20
pollingInterval: 60
cooldownPeriod: 300
scaleTargetRef:
name: my-app
triggers:
- metadata:
query: sum(rate(DCGM_FI_DEV_GPU_UTIL{exported_container=~"my-app"}[2m]))*100
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
threshold: 60
type: prometheus
Conclusion
The integration of NVIDIA DCGM Exporter with KEDA offers a powerful solution for autoscaling GPU-accelerated workloads based on GPU utilization and memory usage. By connecting these components, you can achieve dynamic resource allocation and ensure optimal performance and cost for your GPU-accelerated applications. Follow the steps outlined in this blog post to set up KEDA, DCGM Exporter, and leverage the capabilities of autoscaling based on GPU metrics.