Coding, Computing, and Communication in Distributed Storage Systems

University dissertation from KTH Royal Institute of Technology

Abstract: Conventional studies in communication networks mostly focus on securely and reliably transmitting  data from a source node (or multiple source nodes) to multiple destinations. A more general problem appears when the destination nodes are interested in obtaining  functions of the data available in distributed source nodes. For obtaining a function, transmitting all the data to a destination node and then computing the function might be inefficient. In order to exploit the network resources efficiently, the general problem offers distributed computing in combination with coding and communication. This problem has applications in distributed systems, e.g., in wireless sensor networks, in distributed storage systems, and in distributed computing systems. Following this general problem formulation, we study the optimal and secure recovery of the lost data in storage nodes and in reconstructing a version of a file in distributed storage systems. The significance of this study is due to the fact that the new trends in communications including big data, Internet of things, low latency, and high reliability communications challenge the existing centralized data storage systems. Distributed storage systems can rectify those issues by  distributing  thousands of storage nodes (possibly around the globe), and then benefiting users by bringing data to their proximity.  Yet, distributing the storage nodes brings new challenges. In these distributed systems, where storage nodes  are connected through links and servers, communication plays a main role in their performance. In addition,  a part of network may fail or due to communication failure or delay there might exist multi versions of a file. Moreover, an intruder can overhear the communications between storage nodes and obtain some information about the stored data. Therefore, there are challenges on  reliability, security, availability, and consistency. To increase reliability, systems need to store redundant data in storage nodes and employ error control codes. To maintain the  reliability  in a dynamic environment where storage nodes can fail, the system should have an autonomous repair process. Namely, it should regenerate the failed nodes by the help of other storage nodes. The repair process demands bandwidth, energy, or in general transmission costs.  We propose novel techniques to reduce the repair cost in distributed storage systems. First, we propose {surviving nodes cooperation} in repair, meaning that surviving nodes can combine their received data with their own stored data and then transmit toward the new node. In addition, we study the repair problem in multi-hop networks and consider the cost of transmitting data between storage nodes.  While classical repair model assumes the availability of direct links between the new node and surviving nodes, we consider that such links may not be available either due to failure or their costs.  We formulate an optimization problem to minimize the repair cost and compare two systems, namely with and without surviving nodes cooperation. Second, we study the repair problem where the links between storage nodes are lossy e.g., due to server congestion, load balancing, or unreliable physical layer (wireless links).  We model the lossy links by packet erasure channels and then derive the fundamental bandwidth-storage tradeoff in packet erasure networks. In addition, we propose dedicated-for-repair storage nodes to reduce the repair-bandwidth. Third, we generalize the repair model by proposing the concept of partial repair. That is, storage nodes may lose parts of their stored data. Then in partial repair, the lost data is recovered by exchanging data between storage nodes and using the available data in storage nodes as side information. For efficient partial-repair,  we propose two-layer coding in distributed storage systems and then we derive the optimal bandwidth in partial repair. Fourth, we study security in distributed storage systems.  We investigate security in partial repair. In particular, we propose codes that make the partial repair secure in the senses of strong and weak information-theoretic security definitions. Finally, we study consistency in distributed storage systems. Consistency means that distinct users obtain the latest version of a file in a system that stores multi versions of a file. Given the probability of receiving a version by a storage node and the constraint on the node storage space, we aim to find the optimal encoding of multi versions of a file that maximizes the probability of obtaining the latest version of a file or a version close to the latest version by a read client that connects to a number of storage nodes.