Contributions to Performance Modeling and Management of Data Centers

University dissertation from Stockholm : KTH Royal Institute of Technology

Abstract: Over the last decade, Internet-based services, such as electronic-mail, music-on-demand, and social-network services, have changed the ways we communicate and access information. Usually, the key functionality of such a service is in backend components, which are located in a data center, a facility for hosting computing systems and related equipment. This thesis focuses on two fundamental problems related to the management, dimensioning, and provisioning of such backend components.The first problem centers around resource allocation for a large-scale cloud environment. Data centers have become very large; they often contain hundreds of thousands of machines and applications. In such a data center, resource allocation cannot be efficiently achieved through a traditional management system that is centralized in nature. Therefore, a more scalable solution is needed. To address this problem, we have developed and evaluated a scalable and generic protocol for resource allocation. The protocol is generic in the sense that it can be instantiated for different management objectives through objective functions. The protocol jointly allocates CPU, memory, and network resources to applications that are hosted by the cloud. We prove that the protocol converges to a solution, if an objective function satisfies a certain property. We perform a simulation study of the protocol for realistic scenarios. Simulation results suggest that the quality of the allocation is independent of the system size, up to 100,000 machines and applications, for the management objectives considered.The second problem is related to performance modeling of a distributed key-value store. The specific distributed key-value store we focus on in this thesis is the Spotify storage system. Understanding the performance of the Spotify storage system is essential for achieving a key quality of service objective, namely that the playback latency of a song is sufficiently low. To address this problem, we have developed and evaluated models for predicting the performance of a distributed key-value store for a lightly loaded system. First, we developed a model that allows us to predict the response time distribution of requests. Second, we modeled the capacity of the distributed key-value store for two different object allocation policies. We evaluate the models by comparing model predictions with measurements from two different environments: our lab testbed and a Spotify operational environment. We found that the models are accurate in the sense that the prediction error, i.e., the difference between the model predictions and the measurements from the real systems, is at most 11%.