Asynchronous First-Order Algorithms for Large-Scale Optimization Analysis and Implementation

University dissertation from Stockholm : KTH Royal Institute of Technology

Abstract: Developments in communication and data storage technologies have made large-scale data collection more accessible than ever. The transformation of this data into insight or decisions typically involves solving numerical optimization problems. As the data volumes increase, the optimization problems grow so large that they can no longer be solved on a single computer. This has created a strong interest in developing optimization algorithms that can be executed efficiently on multiple computing nodes in parallel. One way to achieve efficiency in parallel computations is to allow for asynchrony among nodes, which corresponds to making the nodes spend less time coordinating with each other and more time computing, possibly based on delayed information.  However, asynchrony in optimization algorithms runs the risk of otherwise convergent algorithms divergent, and convergence analysis of asynchronous algorithms is generally harder. In the thesis, we develop theory and tools to help understand and implement asynchronous optimization algorithms under time-varying, bounded information delay.In the first part, we analyze the convergence of different asynchronous optimization algorithms. We first propose a new approach for minimizing the average of a large number of smooth component functions. The algorithm uses delayed partial gradient information, and it covers delayed incremental gradient and delayed coordinate descent algorithms as special cases. We show that when the total loss function is strongly convex and the component functions have Lipschitz-continuous gradients, the algorithm has a linear convergence rate. The step size of the algorithm can be selected without knowing the bound on the delay, and still, guarantees convergence to within a predefined level of suboptimality. Then, we analyze two different variants of incremental gradient descent algorithms for regularized optimization problems.  In the first variant, asynchronous mini-batching, we consider solving regularized stochastic optimization problems with smooth loss functions. We show that the algorithm with time-varying step sizes achieves the best-known convergence rates under synchronous operation when (i) the feasible set is compact or (ii) the regularization function is strongly convex, and the feasible set is closed and convex. This means that the delays have an asymptotically negligible effect on the convergence, and we can expect speedups when using asynchronous computations. In the second variant, proximal incremental aggregated gradient, we show that when the objective function is strongly convex, the algorithm with a constant step size that depends on the maximum delay bound and the problem parameters converges globally linearly to the true optimum.In the second part, we first present POLO, an open-source C++ library that focuses on algorithm development. We use the policy-based design approach to decompose the proximal gradient algorithm family into its essential policies. This helps us handle combinatorially increasing design choices with linearly many tools, and generates highly efficient code with small footprint.  Together with its sister library in Julia, POLO.jl, our software framework helps optimization and machine-learning researchers to quickly prototype their ideas, benchmark them against the state-of-the-art, and ultimately deploy the algorithms on different computing platforms in just a few lines of code. Then, using the utilities of our software framework, we build a new, ``serverless'' executor for parallel Alternating Direction Method of Multipliers (ADMM) iterations. We use Amazon Web Services' Lambda functions as the computing nodes, and we observe speedups up to 256 workers and efficiencies above 70% up to 64 workers. These preliminary results suggest that serverless runtimes, together with their availability and elasticity, are promising candidates for scaling the performance of distributed optimization algorithms.