Deterministic, Explainable and Resource-Efficient Stream Processing for Cyber-Physical Systems

Abstract: We are undeniably living in the era of big data , where people and machines generate information at an unprecedented rate. While processing such data can provide immense value, it can prove especially challenging because of the data's Volume, Variety and Velocity . Velocity can be particularly important in environments that need to respond to incoming data in near real-time, such as cyber-physical systems. In such cases, the batch processing paradigm, which requires all data to be persistently stored and available, might not be appropriate. Instead, it can be desirable to perform stream processing , where unbounded datasets of streaming data are processed in an online manner, generating results quickly and thus significantly benefiting applications with strict latency requirements. However, it can be challenging for stream processing to provide the same guarantees and ease-of-use as traditional batch processing systems. This thesis studies ways to alleviate this by introducing techniques that make stream processing more predictable, explainable, and resource-efficient. In the first part of the thesis, we study determinism , which can guarantee predictable and reproducible results in stream processing, regardless of the runtime system characteristics. We present Viper , a module for stream processing frameworks that provides determinism with a minimal performance impact. In the second part, we study fine-grained data provenance , which links each streaming result with the inputs that led to its generation. Fine-grained data provenance can help make stream processing easier to understand and debug. Additionally, it can reduce storage and transmission costs by allowing to maintain only the essential input data. We propose the GeneaLog framework that provides fine-grained data provenance in stream processing with minimal overhead. In the third part of the thesis, we explore scheduling and its use in stream processing for controlling resource allocation and achieving specific performance goals. We develop Haren , a framework that can be integrated into stream processing frameworks, providing custom thread scheduling capabilities. We study Haren's efficiency and its facilities that allow a user to control the resource allocation of a streaming system. We evaluate all three proposed frameworks with relevant streaming use cases from the real-world and illustrate their efficiency and ease-of-use.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)