My dear reader, how are you? السلام عليكم

The only true wisdom is in knowing you know nothing – Socrates 

This post presents a solution concept for Intelligent Resource Management Engine (iRME). iRME aims to provide dynamic resource allocation in modern cloud-based data centers.


My reader, I first suggest you go through an earlier post regarding resource management in cloud data centers DirectMe.

Terminology

Let us first list and explain the terminology we will be using for the proposed solution. 

General terminologies:

  • MS: Microservices – MS is a software development technique (popular in cloud-based applications) that arranges a (cloud) application as a collection of loosely coupled services. In a microservices architecture, services are fine-grained and the protocols are lightweight.
  • SG: Server Group – We define an SG as a collection of microservices representing particular features of a cloud application.
  • WL: Workload – A WL is defined as an SG or a collection of SG executing on one virtual machine (VM).
  • Host: Physical Machine – We call a physical machine (typically a blade server in cloud data center) as a Host.
  • Node: Virtual Machine – We call a virtual machine as a node. A host may be running a node or a collection of nodes.

iRME specific terminologies:

  • CR: Current Resource – The host and node resource allocation for the WL executing on a node at a particular instance.
  • SR: Spare Resource – The host and node resources that are not in use at an instance.
  • CI: Compute Intensive – The host or node executing a WL that is utilizing more CPU resources is termed as CI.
  • MI: Memory Intensive – The host or node executing a WL that is utilizing more memory resources is termed as MI.
  • NI: Not Intensive – If CPU and memory consumption for a host or node is not exceeding a specific limit is it labeled as NI.
  • RR: Required Resource – The resource shortage on a host or node executing WL(s)
  • UR: Updated Resource – If the resource specifications are changed because of underload or overload conditions, we save the new resource allocations as UR.
  • MSH: Most Suitable Host – The best host that can support a particular node according to the WL intensity.
  • CS: Computational Stress – A metric to calculate CI, MI, and NI. It is defined as the number of floating-point operations divided by the memory accesses on a given node. If the CS for node executing a WL is greater than 10, we label the node as CI. If CS is less than 0, we label the node as MI. If CS is ~=5, we label it NI.

Scenario Illustration

Figure 1 represents the typical workload executing on a node. A WL is a collection of SGs for an application feature. The collection of MS constitutes SG.

Fig 1: A typical representation of a WL.

Figure 2 represents the number of nodes (VMs) running on blade servers (or hosts) in cloud data centers.

Fig 2: A typical representation of a host executing a number of nodes.

Figure 3 illustrates the concept of live migration of the nodes across multiple hosts depending on the resource requirements and computational stress. The live migration support is provided in almost all modern virtual machine platforms.

Fig 3: Live migration of nodes.

iRME Design Objectives

While designing the solution concept for iRME, I have the following two objectives in mind.

  • Scalability
  • Fault Tolerance

Characteristics of Proposed Solution

  • iRME has a 2-tier software architecture
  • There are two Resource Controllers. Both have different functionalities based on the domains they cover/manage.
    • NativeController (Coverage: Node)
    • CosmoController (Coverage: Hosts)

Fig 4: Architecture for iRME.

Key Sub-problems Addressed by iRME Controllers

  1. Profile the resources (CPU, memory, and disks) of nodes executing SGs for defined time period to understand resource utilization (NativeController)
  2. Detect over or under-load nodes to resize them accordingly (NativeController)
  3. Select nodes to migrate among the hosts (CosmoController)
  4. Place nodes on MSH (CosmoController)

iRME Stages: How it works?

Initial Steps

  • Allocate node resources (or CR) and assign them the SGs based on:
    1. Test Results, functional points provided by the service, and characteristics of the service
    2. Host characteristics (Sum of node resources < host resources)
    3. MS from SG are deployed based on anti-affinity rules
  • Calculate the available resources (or SR) for the hosts

NativeController

Function: Resize the nodes based on utilizations and send the information to CosmoController

Input:

  1. CR
  2. SR

Steps

  • Measure node utilization (CPU, memory, and disk) at 1-minute granularity.
  • Record FLOPS and memory accesses and calculate computational intensity.
  • Based on Computational Stress (CS) calculated using FLOPS/Memory Accesses, label the nodes as:
    • Compute Intensive (CI)
    • Memory Intensive (MI)
    • Not Intensive (NI)
  • A record number of active MS from each SG running on a node.
  • Based on utilization patterns for 30 minutes slot, resize the nodes using the following rules:
    • Check: If resource (CPU, memory, disk) utilization in a slot is high (over 90% for 20% data points or over 95% for 10% data points).
      • Calculate 10% of the current resource allocations (i.e., RR) and check permissions.
        1. If RR < SR
          • Allocate RR for the node, (say UR).
          • Recalculate SR and send UR information for the particular node to CosmoController.
          • Update CR with UR.
        2. If RR > SR
          • Send RR information to CosmoController to take necessary actions.
        3. Send the node resource allocations to CosmoController.
    • Check: If resource (CPU, memory, disk) utilization in a slot is low
      1. if Less than 60% for 95% of data points
        • Reduce CR by 20%
      2. if Less than 40% for 95% of data points
        • Reduce CR by 40%
      3.  Recalculate SR and send node resource allocations to CosmoController.

CosmoController

Function: Receive information from NativeController and perform node migrations

Input:

  1. RR information for a node if greater than SR on a host (optional).
  2. UR information for a node corresponding to the host.

Steps

  • Compile CR, RR (optional) for all nodes deployed on each host and SR on each host.
  • For a node with RR
    1. Find Most Suitable Host (MSH).
      • For all hosts with SR greater than RR of the node
        • Calculate SR-RR for all hosts.
      • MSH is the one with minimum SR-RR.
    2. Migrate the node to MSH.
    3. Remove RR information for the node and update node CR and SR for both hosts whose nodes have been migrated.
  • Migrate the nodes with minimum CR to the hosts with maximum SRs.
  • Shutdown the host with no node (Optional).

–END


I hope you find this post useful. If you find any errors or feel any need for improvement, let me know in your comments below.

Signing off for today. Stay tuned and I will see you in my next post! Happy learning.

LEAVE A REPLY

Please enter your comment!
Please enter your name here