About this Digital Document
This dissertation consists of 2 parts and 5 chapters. In Chapter 1, we present fundamental concepts, on which the works are based, and provide an overview of this dissertation. In Part I, we develop two novel adaptive algorithms to solve machine learning problems, derive their theoretical convergence, conduct numerical experiment on binary classification problems, and explore empirical merits of adopting both methods to solve machine learning problems in particular settings. In Part II, we develop deep neural networks to solve engineering application problems, analyze the challenges in training deep neural networks for these problems, propose to use second-order methods for training the networks, and conduct numerical experiment to evaluate the performance of the proposed methods.In Chapter 2, we present a first-order adaptive and implicit stochastic recursive gradient method to solve the Empirical Risk Minimization (ERM) or Finite-Sum problems in machine learning. This algorithm employs the outer-inner loop structure of classic stochastic recursive gradient method yet being developed as a fully adaptive algorithm without any hyper-parameter tuning. This algorithm adopts implicit approach to estimate the local geometry and dynamically determine the step-size at every iteration. With strongly convex Finite-Sum and convex component functions, it is guaranteed to converge linearly, and can produce a possibly faster convergence than classic stochastic recursive gradient method if local geometry permits. This algorithm can be easily implemented and is computationally efficient. In practice, compared with other state-of-art first-order methods with fine-tune hyper-parameters, this algorithm exhibits competitive empirical performance. In spite of being tune-free, this algorithm consistently outperforms its fine-tuned classic counterparts on an extensive set of problems and cases.In Chapter 3, we present a second-order doubly adaptive scaled algorithm to solve ERM or Finite-Sum problems in machine learning. This algorithm adopts efficient Hessian diagonal estimate to scale the gradient in updating the iterates, and it only requires Hessian-vector product computations which can be implemented very efficiently. This algorithm is a fully adaptive method with a possible exception to the bias correction and truncation hyper-parameters, both of which are shown to be insensitive to the empirical performance provided different values. With convex and strongly convex Finite-Sum, this algorithm is guaranteed to converge with adaptive step-size in deterministic regime. As an extension to the algorithm, a preconditioned Trust Region Newton-CG method is explored where this algorithm is used to precondition the sub-problem that Conjugate Gradient method solves, and the empirical study demonstrates its merits in practice.In Chapter 4, we present a graph-embedding periodic system neural network framework to predict the potential energy surface (PES) for the periodic molecular systems. The PES is a function of various input, such as atomic coordinates, atomic type and so on, and our networks can perform the prediction with a desired accuracy by solely using the coordinates. The challenges of developing and training a neural network for predicting the PES is fully discussed and explained, and that serves as an inspiration of using second-order methods to tackle the challenges. The second-order methods chosen are Trust Region Newton-CG and the stochastic variant of the proposed algorithm in Chapter 3. In empirical study, the second-order methods yield stronger performance in tacking the training challenges and deliver competitive results, compared with state-of-art first-order methods. As an extension of the existing framework, this chapter explores the potential benefits of introducing forces, which can be modeled as the negative gradient with respect to the atomic coordinates. The empirical study demonstrates that, with synthetic dataset, incorporation of forces might be beneficial to achieve desired predictive accuracy when the networks are trained by second-order methods.In Chapter 5, we present a finite difference neural network framework to solve linear partial differential equations. This framework is designed based on the finite-difference approximation of partial derivatives to learn the underlying governing partial differential equations from trajectory data. The critical components of the networks are learning blocks, which are essentially residual networks. The learning blocks are capable of capturing the behaviors of the underlying dynamical systems based on the trajectory data. With the novel design of the networks, it is remarkable that the target solutions of partial differential equations can be learned with small number of parameters. In this chapter, we also propose to use Trust Region Newton-CG method to train the network and compare it with the strong baseline first-order method. The empirical study exhibits the superior performance of training the proposed networks with the chosen second-order method.