About this Digital Document
Many important problems in machine learning (ML) and data science are formulated as optimization problems and solved using optimization algorithms. With the scale of modern datasets and the ill-conditioned nature of real-world problems, new powerful algorithms are needed to address the aforementioned issues. The goal of this thesis is to develop and design efficient and scalable optimization algorithms with corresponding convergence guarantees for ill-conditioned problems that often arise in large-scale ML. The thesis contains three main parts with 5 chapters. Chapter 1 deals with novel accelerated gradient methods with optimality certificates. Chapters 2 and 3 talks about new quasi-Newton algorithms. Moreover, chapters 4 and 5 deal with efficient and scalable distributed optimization algorithms.The focus of Chapter 1 is to address the question of how to construct an appropriate sequence of lower bounds for strongly convex smooth functions and for strongly convex composite functions. We propose several first order methods for minimizing strongly convex functions in both the smooth and composite cases. We will show that all presented algorithms converge linearly, with the accelerated variants enjoying the optimal linear rate of convergence.Chapter 2 is concerned with two sampled quasi-Newton methods: sampled LBFGS and sampled LSR1. Contrary to the classical variants of these methods that sequentially build (inverse) Hessian approximations as the optimization progresses, our proposed methods sample points randomly around the current iterate to produce these approximations. As a result, the approximations constructed make use of more reliable (recent and local) information, and do not depend on past information that could be significantly stale.Chapter 3 describes a new optimization algorithm that bridges the gap between first- and second-order methods by computing a search direction that uses a second-order-type update in one subspace, coupled with a scaled steepest descent step in the orthogonal complement subspace. To this end, partial curvature information is incorporated to help with ill-conditioning, while simultaneously allowing the algorithm to scale to the large problem dimensions often encountered in machine learning applications.In Chapter 4, a novel approach is proposed in which the sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. Our proposed method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples (including the samples in the previous stage). The proposed multistage algorithm reduces the number of passes over data to achieve the approximate statistical accuracy of the full training set. Moreover, our algorithm in nature is easy to be distributed and shares the strong scaling property indicating that acceleration is always expected by using more computing nodes.Chapter 5 presents a scalable distributed implementation of the Sampled Limited-memory Symmetric Rank-1 (S-LSR1) algorithm. First, we show that a naive distributed implementation of S-LSR1 requires multiple rounds of expensive communications at every iteration and thus is inefficient. We then propose DS-LSR1, a communication-efficient variant that: (i) drastically reduces the amount of data communicated at every iteration, (ii) has favorable work-load balancing across nodes, and (iii) is matrix-free and inverse-free.