Building Distributed Systems for High-Stress Environments using Reversibility and Phase-Awareness

University dissertation from Stockholm : KTH Royal Institute of Technology

Abstract: Large-scale applications for mobile devices and Internet of Things live in stressful real-world environments: they have both continuous faults and bursts of high faults. Typical faults are node crashes, network partitions, and communication delays. In this thesis, we propose a principled approach to build applications that survive in such environments by using the concepts of Reversibility and Phase. A system is Reversible if the set of operations it provides depends on its current stress,  and not on the history of the stress. By stress we imply all the potential perturbing effects of the environment on the system, which includes both faults and other nonfunctional properties such as communication delay and bandwidth. Reversibility generalizes standard fault tolerance with nested fault models. When the stress causes the fault rate to go outside one model then it is still inside the scope of the next model. As stress is a global condition that cannot easily be measured by individual nodes, we propose the concept of Phase in order to approximate the set of available operations of the system at each node. Phase is a per-node property, and can be determined with no additional distributed computation.  We present two case studies.  First, we present a transactional key-value store built on a structured overlay network and we explain how to make it Reversible.  Second, we present a distributed collaborative graphic editor built on top of the key-value store, and we explain how to make it Phase-Aware, i.e., it optimizes its behavior according to a real-time observation of phase at each node using a Phase API. This shows the usefulness of Reversibility and Phase-Awareness for building large-scale Internet applications.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)