Learning to Control the Cloud

Abstract: With the growth of the cloud industry in recent years, the energy consumption of the underlying infrastructure is a major concern.The need for energy efficient resource management and control in the cloud becomes increasingly important as one part of the solution, where the other is to reduce the energy consumption of the hardware itself.Resource management in the cloud is typically done using relatively simple methods, with either local controllers or human operators, though as the complexity of the system increases, the need for more intelligent and automated controllers increases as well.The cloud is a complex environment with many individual consumers sharing large pools of resources, scaling and moving their applications to satisfy their own objectives and requirements, while the cloud provider manages the underlying infrastructure to make efficient use of the hardware.This creates a dynamic environment with a highly variable load, and managing efficient resource usage while keeping the quality of service at an acceptable level is a complex task for such unpredictable environments.Both the consumers scaling their resources and the providers managing their infrastructure could benefit from intelligent automation.By creating control strategies that take a larger context into account, it could allow for more informed decisions, and thus better control.A larger context makes the problem space more complex, and manually designing a controller becomes increasingly difficult.With the abundance of data available in many cloud systems, a data-driven approach seems like a natural choice. Reinforcement learning is a type of machine learning that is well suited for sequential decisions over time, and has been shown to be able to learn complex control strategies in many different domains.We explore the benefits and challenges of applying reinforcement learning methods to control different cloud systems according to complex objectives, and what usability concerns that show up in practice.Starting off, we explore the combined control of cooling systems and load balancing in a datacenter. Cooling is a major energy consumer in datacenters, giving us a natural objective for optimization, and the load balancing will affect the heat distribution in the datacenter, thus affecting the cooling.In a simple simulated environment, we apply reinforcement learning to control a mix of discrete and continuous control variables over both cooling and load balancing, with the objective to reduce energy consumption while adhering to temperature thresholds for the servers. We find that the controller is able to learn how to efficiently use the cooling system, improving on a baseline implemented using standard methods.Scaling this up and adding a more realistic air-flow simulation, we find that the gain from perfect placement is so small that it is simply generating noise in comparison to other factors in the cooling system.Instead, we focus on controlling the cooling system with the larger observational context, showing that it outperforms existing standard methods while also being able to adapt to changes in the system.We then look at the problem of scaling a web services in a cloud environment, where a service is built from many interconnected microservices.These are typically scaled using local reactive controllers, but employing a proactive controller should improve the performance.By providing a reinforcement learning agent with a view over all the services, it implicitly learns how different jobs traverse the system, and use this to proactively scale services, keeping less resources in reserve, and still meeting response time requirements.Moving from model-free control, we turn to using an existing fluid model of a microservice to create a controller.The fluid model is used to simulate trajectories for a load balancing controller, and using arbitrary loss functions over the trajectory, we can optimize the parameters of the controller using automatic differentiation.The resulting controller behaves well, though we only take a single gradient step to ensure stable updates, since the accuracy of the fluid model is reduced as the system moves away from the training data.We then show how an imperfect model can be extended with neural networks to capture unmodelled dynamics.For the fluid model, the increased accuracy from the extended model allows for more steps and thus faster policy convergence.While we find that RL can indeed be used to create policies that improve on standard control methods, there are several usability concerns that arise when applying these methods to real systems.The main issue is the instability of the whole process, from exploration during training driving the system to bad states, to opaque function approximators making it difficult to ensure that the controller behaves as expected when deployed.While we discuss several methods to mitigate these issues, what actually works is highly dependent on the specific system and the requirements on the controller.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)