Observability: DevOps meets green IT
At the Atos Tech for Climate summit during the 2023 global climate congress in Sharm el Sheikh, an emerging IT concept was presented and discussed: observability. Although originally designed to facilitate continuous improvement in DevOps, observability holds great promise for reducing IT energy consumption and thereby the emission of greenhouse gases in data centers and at the edge.
What is observability?
In control theory, the term “observability” describes the ability to draw conclusions about the internal state of a system from the relationship between its inputs and outputs.
The underlying assumption is that a system’s behavior is repeatable – when it is subjected to the same input stimuli again, it will assume the same or a similar internal state again and produce similar outputs.
A digital twin can therefore be used to emulate a system’s state and find or test ways to optimize its operation through experiments.
Observability holds great promise for reducing IT energy consumption and thereby the emission of greenhouse gases in data centers and at the edge.
In IT, system refers to a complete solution stack designed for a given purpose, – comprising application software, an operating platform and hardware underneath. Each of these components can be optimized for performance, reliability or – increasingly – energy consumption, making observability more significant than monitoring.
Monitoring is the passive gathering of information, writes Stephen J. Bigelow in What is observability? A beginners guide. Monitoring produces some of the outputs used by observability tools to determine the state of a system and suggest optimization measures. Another difference is that monitoring focuses on the hardware and operating platform layers of the solution stack – even in disciplines like application performance management (APM) which seek to transcend the infrastructure-application boundary.
Observability and continuous software improvement
In DevOps, observability is key to debugging and improving software applications while they are in production. Even when developed the traditional way, with a lengthy testing phase before productive use, applications are rarely error-free. Therefore, in-production debugging is essential.
The observability of modern, containerized cloud-native applications is based on three pillars:
- Aggregatable metrics: They are obtained through traditional system monitoring. APM can strongly contribute to this process. Infrastructure performance and other HW-level metrics also play a key role when observability is utilized for energy savings.
- Logging: It is the accumulation of evidence and details related to discrete events. Evaluation of IT logs – where events should not be limited to errors and warnings – is the twin sister of monitoring; it often serves a double purpose of optimization and operational security and may employ AI/ML (artificial intelligence / machine learning) to do so. Atos AIsaac is an example of an AI/ML-based operation security system, which, among many other things, ingests logs.
- Tracing: This is a discipline specific to service or microservice architectures, such as cloud-native software. Tracing elucidates how a request to the application travels through and is handled by the cloud-native mesh of microservices. These meshes may use load balancing and dynamically load and scale, so request traces have limited repeatability.
Figure 1 illustrates these three pillars graphically.
Figure 1: The three pillars of observability
A cloud-native Kubernetes platform like Red Hat OpenShift – the foundation of Atos AMOS (Atos Managed OpenShift) – typically utilizes three PaaS (platform as a service) components to achieve observability:
- Istio: A service that controls load balancing within a Kubernetes cluster and steers the flow of requests to and through the application’s distributed microservice mesh.
While it does not provide observability, it integrates with Kiali and Jaeger to make service mesh operations observable. Its functionality has been further enhanced in Red Hat OpenShift service mesh, giving OpenShift an advantage over other Kubernetes PaaS distributions.
- Kiali: A service that traces requests throughout the mesh – essential for microservice tracing.
- Jaeger: A service that records traces and makes the recordings available to Istio and Kiali for subsequent optimization.
Observability and IT sustainability
At the Atos Tech for Climate summit 2023 in Sharm el Sheikh, Vincent Caldeira – Field CTO APAC, Red Hat – delivered a presentation on how to enhance the sustainability of cloud-native application design through observability.
Observability has the potential to help reduce energy consumption of IT systems and thereby contribute to scope 2 of green IT. (Scope 1 is the reduction of direct greenhouse gas emissions – which IT systems rarely have – and scope 3 aims at reducing other indirect emissions, like in the supply chain).
In IT sustainability, observability serves several purposes:
1. It ensures proper utilization of hardware resources.
Cloud-native software, in particular, can be designed to scale almost without limits by varying the number of parallel instances of a particular service in the service mesh. In Kubernetes clusters that are underutilized, service instances can be concentrated on a subset of cluster nodes and the remaining nodes may be hibernated or shut down.
A community project named KEDA (Kubernetes Event-Driven Autoscaling) has been launched by founding partners Microsoft, Red Hat and SCRM to ease application autoscaling.
2. It ensures the right mapping between workload type and hardware resources.
Not all hardware of a type is created equal. During the Tech for Climate summit, several hardware vendors presented energy-saving server hardware designed for diverse purposes, or differences in energy consumption between different types of flash storage.
Applications can also be designed such that computationally intensive special tasks can be offloaded to GPUs (graphical processing units) through Kubernetes GPU operators.
Modern CPUs can adapt to workload requirements through DVFS (dynamic voltage and frequency scaling) and P-States/C-States – state-of-the-art for today’s portable devices. Starting with its Sandy Bridge line of processors, Intel has introduced RAPL (running average power limit) to report accumulated energy consumption metrics across various power domains.
3. It measures progress.
No matter what optimization is applied, be it dynamic load rebalancing or improvements to software design, evidence of the results must be given — to the control loop feedback to the optimization process as well as reporting to the organization’s sustainability program. Business processes may be subject to DLAs (decarbonization level agreements) over time, and progress reporting is essential to provide evidence of the program’s success.
One challenge with gathering hardware power consumption metrics is that traditionally, only average figures can be obtained and correlating the hardware energy metrics with cloud-native application performance is difficult. In her article “How to approach sustainability in IT operations”, TechTarget author Emily Foster describes the work of Red Hat’s Huamin Chen and IBM’s Chen Wang supporting these efforts with KEPLER (Kubernetes efficient power level exporter).
KEPLER is a tool for capturing CPU, GPU and RAM metrics that can be processed with Prometheus and other compatible tooling. The strength of KEPLER is its holistic nature. Although, in essence, it is a hardware monitoring tool, KEPLER collects metrics for the CPU (through RAPL), GPU (through NVML, the NVIDIA Management Library) and other resources, and produces aggregate metrics which can be collated with Kiali/Jaeger traces to optimize cloud-native applications for energy efficiency.
We can expect that future versions of Red Hat OpenShift, Atos AMOS and other leading Kubernetes PaaS/CaaS will incorporate powerful tooling to make cloud-native application development and operation increasingly sustainable, to support the IT industry’s accelerating journey towards net-zero.