Applied optimization

This file covers practical and numerical approaches to optimization problems. For the mathematical theory, see Optimization(Multivariable).

Numerical Optimization Methods

Gradient Descent

Gradient descent is an iterative first-order optimization algorithm for finding a local minimum of a differentiable function.

Basic Algorithm

Start with an initial guess $\mathbf{x}_0$
Update the position iteratively: $\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)$
Terminate when $|\nabla f(\mathbf{x}_k)| < \epsilon$ or after a maximum number of iterations

Types of Gradient Descent

Batch Gradient Descent: Uses the entire dataset to compute the gradient in each iteration
Stochastic Gradient Descent (SGD): Uses a single randomly chosen sample in each iteration
Mini-batch Gradient Descent: Uses a small random subset of data in each iteration

Learning Rate Strategies

Fixed Learning Rate: $\alpha_k = \alpha$ is constant
Time-based Decay: $\alpha_k = \frac{\alpha_0}{1 + kt}$, where $t$ is the decay rate
Exponential Decay: $\alpha_k = \alpha_0 e^{-kt}$
Line Search: Find $\alpha_k$ that approximately minimizes $f(\mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k))$

Advantages

Simple to implement and understand
Low memory requirements
Works well for many problems

Disadvantages

Slow convergence for ill-conditioned problems
Can get stuck in shallow local minima
Performance depends heavily on the choice of learning rate

Newton’s Method

Newton’s method is a second-order optimization algorithm that uses both gradient and Hessian information.

Algorithm

Start with an initial guess $\mathbf{x}_0$
Update: $\mathbf{x}_{k+1} = \mathbf{x}_k - [\mathbf{H}_f(\mathbf{x}_k)]^{-1} \nabla f(\mathbf{x}_k)$
Terminate when $|\nabla f(\mathbf{x}_k)| < \epsilon$

Advantages

Quadratic convergence rate near the solution
Often requires fewer iterations than gradient descent

Disadvantages

Requires computing and inverting the Hessian matrix, which is computationally expensive
Not guaranteed to converge for non-convex functions
Can diverge if the initial guess is far from the solution

Quasi-Newton Methods

Quasi-Newton methods approximate the Hessian or its inverse using gradient information, avoiding the costly computation of second derivatives.

BFGS (Broyden-Fletcher-Goldfarb-Shanno) Algorithm

Start with initial guess $\mathbf{x}_0$ and Hessian approximation $\mathbf{B}_0$ (usually the identity matrix)
Compute the search direction $\mathbf{p}_k = -\mathbf{B}_k^{-1} \nabla f(\mathbf{x}_k)$
Perform line search to find $\alpha_k$
Update: $\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k$
Compute $\mathbf{s}k = \mathbf{x}{k+1} - \mathbf{x}k$ and $\mathbf{y}_k = \nabla f(\mathbf{x}{k+1}) - \nabla f(\mathbf{x}_k)$
Update the Hessian approximation: $\mathbf{B}_{k+1} = \mathbf{B}_k + \frac{\mathbf{y}_k\mathbf{y}_k^T}{\mathbf{y}_k^T\mathbf{s}_k} - \frac{\mathbf{B}_k\mathbf{s}_k\mathbf{s}_k^T\mathbf{B}_k}{\mathbf{s}_k^T\mathbf{B}_k\mathbf{s}_k}$

L-BFGS (Limited-memory BFGS)

Stores only the most recent m vector pairs ${\mathbf{s}_i, \mathbf{y}_i}$ instead of the full matrix
Computes matrix-vector products implicitly
Suitable for large-scale problems

Advantages

Faster convergence than gradient descent
No need to compute second derivatives
L-BFGS has low memory requirements

Disadvantages

More complex to implement than gradient descent
Not as efficient as Newton’s method near the solution
May not converge for non-smooth functions

Conjugate Gradient Method

The conjugate gradient method is designed to solve linear systems of the form $\mathbf{A}\mathbf{x} = \mathbf{b}$ or minimize quadratic functions, but can be extended to non-quadratic functions.

Algorithm for Quadratic Functions

Start with initial guess $\mathbf{x}_0$ and compute $\mathbf{r}_0 = \mathbf{b} - \mathbf{A}\mathbf{x}_0$, $\mathbf{p}_0 = \mathbf{r}_0$
For $k = 0, 1, 2, \ldots$ until convergence: a. Compute $\alpha_k = \frac{\mathbf{r}k^T\mathbf{r}_k}{\mathbf{p}_k^T\mathbf{A}\mathbf{p}_k}$ b. Update $\mathbf{x}{k+1} = \mathbf{x}k + \alpha_k\mathbf{p}_k$ c. Compute new residual $\mathbf{r}{k+1} = \mathbf{r}k - \alpha_k\mathbf{A}\mathbf{p}_k$ d. Compute $\beta_k = \frac{\mathbf{r}{k+1}^T\mathbf{r}{k+1}}{\mathbf{r}_k^T\mathbf{r}_k}$ e. Update search direction $\mathbf{p}{k+1} = \mathbf{r}_{k+1} + \beta_k\mathbf{p}_k$

Nonlinear Conjugate Gradient

For general functions, $\mathbf{A}$ is replaced with the Hessian, and the residual is the negative gradient.

Advantages

More efficient than gradient descent
Does not require storing or inverting matrices
Guaranteed to converge in at most n steps for n-dimensional quadratic functions

Disadvantages

More complex than gradient descent
Less effective for highly non-quadratic functions
Numerical stability issues can occur

Trust Region Methods

Trust region methods define a region around the current point within which a model (usually quadratic) is trusted to be a good approximation of the objective function.

Algorithm

Start with initial guess $\mathbf{x}_0$ and trust region radius $\Delta_0$
In each iteration k: a. Construct a quadratic model: $m_k(\mathbf{p}) = f(\mathbf{x}k) + \nabla f(\mathbf{x}_k)^T\mathbf{p} + \frac{1}{2}\mathbf{p}^T\mathbf{B}_k\mathbf{p}$ b. Find approximate solution $\mathbf{p}_k$ to the subproblem: $\min{\mathbf{p}} m_k(\mathbf{p})$ subject to $|\mathbf{p}| \leq \Delta_k$ c. Compute ratio $\rho_k = \frac{f(\mathbf{x}k) - f(\mathbf{x}_k + \mathbf{p}_k)}{m_k(\mathbf{0}) - m_k(\mathbf{p}_k)}$ d. Update $\mathbf{x}{k+1}$ and $\Delta_{k+1}$ based on $\rho_k$:
- If $\rho_k$ is close to 1, increase the trust region radius
- If $\rho_k$ is small or negative, decrease the trust region radius
- If $\rho_k$ is sufficiently positive, set $\mathbf{x}{k+1} = \mathbf{x}_k + \mathbf{p}_k$, otherwise $\mathbf{x}{k+1} = \mathbf{x}_k$

Advantages

More robust than line search methods
Can handle non-convex functions better
Naturally adapts step size based on model accuracy

Disadvantages

Subproblem solving can be computationally expensive
More complex to implement
Usually requires more function evaluations per iteration

Evolutionary Algorithms

Evolutionary algorithms are population-based metaheuristic optimization algorithms inspired by biological evolution.

Genetic Algorithm

Initialize a population of candidate solutions
Evaluate the fitness of each candidate
Repeat until termination: a. Select parents based on fitness b. Recombine pairs of parents to form new offspring c. Mutate offspring with small probability d. Evaluate fitness of offspring e. Select individuals for the next generation

Particle Swarm Optimization (PSO)

Initialize particles with random positions and velocities
Evaluate the objective function for each particle
Update personal best for each particle
Update global best among all particles
Update velocity and position of each particle based on personal and global best
Repeat steps 2-5 until convergence

Advantages

Can handle non-differentiable, discontinuous, and noisy functions
Good for global optimization (less likely to get stuck in local optima)
Easily parallelizable

Disadvantages

No guarantee of finding the global optimum
Slower convergence compared to gradient-based methods
Requires careful parameter tuning

Convex Optimization

Convex optimization deals with minimizing convex functions over convex sets, which guarantees that any local minimum is also a global minimum.

Convex Sets and Functions

Convex Set

A set $C$ is convex if for any two points $\mathbf{x}, \mathbf{y} \in C$ and any $\theta \in [0,1]$, the point $\theta \mathbf{x} + (1-\theta) \mathbf{y} \in C$.

Convex Function

A function $f$ is convex if its domain is a convex set, and for all $\mathbf{x}, \mathbf{y}$ in the domain and all $\theta \in [0,1]$: $f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y})$

Strictly Convex Function

A function is strictly convex if the above inequality is strict whenever $\mathbf{x} \neq \mathbf{y}$ and $\theta \in (0,1)$.

Classes of Convex Optimization Problems

Linear Programming (LP)

Minimize a linear objective function subject to linear equality and inequality constraints: $\min_{\mathbf{x}} \mathbf{c}^T\mathbf{x} \quad \text{subject to} \quad \mathbf{A}\mathbf{x} \leq \mathbf{b}, \, \mathbf{E}\mathbf{x} = \mathbf{d}$

Quadratic Programming (QP)

Minimize a quadratic objective function subject to linear constraints: $\min_{\mathbf{x}} \frac{1}{2}\mathbf{x}^T\mathbf{Q}\mathbf{x} + \mathbf{c}^T\mathbf{x} \quad \text{subject to} \quad \mathbf{A}\mathbf{x} \leq \mathbf{b}, \, \mathbf{E}\mathbf{x} = \mathbf{d}$ where $\mathbf{Q}$ is positive semidefinite.

Second-Order Cone Programming (SOCP)

Minimize a linear function subject to second-order cone constraints: $\min_{\mathbf{x}} \mathbf{c}^T\mathbf{x} \quad \text{subject to} \quad \|\mathbf{A}_i\mathbf{x} + \mathbf{b}_i\|_2 \leq \mathbf{c}_i^T\mathbf{x} + d_i, \, \mathbf{F}\mathbf{x} = \mathbf{g}$

Semidefinite Programming (SDP)

Minimize a linear function subject to the constraint that a matrix is positive semidefinite: $\min_{\mathbf{x}} \mathbf{c}^T\mathbf{x} \quad \text{subject to} \quad \mathbf{F}_0 + \sum_{i=1}^{n} x_i\mathbf{F}_i \succeq 0, \, \mathbf{A}\mathbf{x} = \mathbf{b}$

Convex Optimization Algorithms

Interior-Point Methods

Convert inequality constraints to equality constraints using slack variables
Incorporate constraints into the objective using a barrier function
Solve a sequence of subproblems with decreasing barrier parameter
Converge to the optimal solution

Cutting-Plane Methods

Start with a simple relaxation of the feasible set
Solve the relaxed problem
Add constraints (cuts) to exclude the current solution if it’s not feasible for the original problem
Repeat until finding a feasible and optimal solution

Subgradient Methods

For non-differentiable convex functions, use subgradients instead of gradients:

Start with initial guess $\mathbf{x}_0$
Update: $\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \mathbf{g}_k$, where $\mathbf{g}_k$ is a subgradient of $f$ at $\mathbf{x}_k$
Use diminishing step sizes: $\alpha_k \to 0$ as $k \to \infty$, but $\sum_{k=0}^{\infty} \alpha_k = \infty$

Constrained Optimization Algorithms

Penalty Methods

Penalty methods transform constrained optimization problems into unconstrained problems by adding penalty terms for constraint violations.

External Penalty Method

Replace the original constrained problem with: $\min_{\mathbf{x}} f(\mathbf{x}) + \mu_k \sum_i P(g_i(\mathbf{x})) + \mu_k \sum_j Q(h_j(\mathbf{x}))$ where $P$ and $Q$ are penalty functions for inequality and equality constraints
Solve a sequence of unconstrained problems with increasing penalty parameter $\mu_k$

Internal Penalty (Barrier) Method

For inequality constraints $g_i(\mathbf{x}) \leq 0$:

Replace the original problem with: $\min_{\mathbf{x}} f(\mathbf{x}) - \mu_k \sum_i \ln(-g_i(\mathbf{x}))$
Solve a sequence of unconstrained problems with decreasing barrier parameter $\mu_k$

Augmented Lagrangian Methods

Augmented Lagrangian methods combine Lagrange multipliers with penalty functions:

Algorithm

Start with initial guess $\mathbf{x}_0$, multipliers $\lambda_0$, $\mu_0$, and penalty parameter $\rho_0$
For $k = 0, 1, 2, \ldots$: a. Approximately minimize the augmented Lagrangian: $L_A(\mathbf{x}, \lambda_k, \mu_k; \rho_k) = f(\mathbf{x}) + \sum_i \lambda_{k,i} g_i(\mathbf{x}) + \sum_j \mu_{k,j} h_j(\mathbf{x}) + \frac{\rho_k}{2} \left( \sum_i \max(0, g_i(\mathbf{x}))^2 + \sum_j h_j(\mathbf{x})^2 \right)$ b. Update Lagrange multipliers: $\lambda_{k+1,i} = \lambda_{k,i} + \rho_k \max(0, g_i(\mathbf{x}_{k+1}))$ $\mu_{k+1,j} = \mu_{k,j} + \rho_k h_j(\mathbf{x}_{k+1})$ c. Update penalty parameter $\rho_{k+1}$ based on constraint violation

Sequential Quadratic Programming (SQP)

SQP solves a sequence of quadratic programming subproblems that approximate the original problem.

Algorithm

Start with initial guess $\mathbf{x}_0$, Lagrange multiplier estimates $\lambda_0$, and Hessian approximation $\mathbf{B}_0$
For $k = 0, 1, 2, \ldots$: a. Formulate and solve the QP subproblem: $\min_{\mathbf{p}} \nabla f(\mathbf{x}_k)^T \mathbf{p} + \frac{1}{2} \mathbf{p}^T \mathbf{B}_k \mathbf{p}$ $\text{subject to } \nabla g_i(\mathbf{x}_k)^T \mathbf{p} + g_i(\mathbf{x}_k) \leq 0$ $\nabla h_j(\mathbf{x}_k)^T \mathbf{p} + h_j(\mathbf{x}_k) = 0$ b. Perform line search to find step size $\alpha_k$ c. Update: $\mathbf{x}{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k$ d. Update Lagrange multipliers $\lambda{k+1}$ from the QP solution e. Update Hessian approximation $\mathbf{B}_{k+1}$ using BFGS or similar method

Active Set Methods

Active set methods maintain a guess of the active inequality constraints at the solution.

Algorithm

Start with a feasible point $\mathbf{x}_0$ and an initial working set $\mathcal{W}_0$ of active constraints
For $k = 0, 1, 2, \ldots$: a. Solve the equality-constrained subproblem using constraints in $\mathcal{W}_k$ b. Compute Lagrange multipliers $\lambda_i$ for constraints in $\mathcal{W}_k$ c. If all $\lambda_i \geq 0$, check for additional violated constraints:
- If none, $\mathbf{x}_k$ is optimal
- If some constraint is violated, add the most violated one to $\mathcal{W}k$ d. If some $\lambda_i < 0$, remove the constraint with the most negative $\lambda_i$ from $\mathcal{W}_k$ e. Update $\mathbf{x}{k+1}$ and $\mathcal{W}_{k+1}$

Interior-Point Methods for Constrained Optimization

Interior-point methods traverse the interior of the feasible region toward the optimal solution.

Primal-Dual Interior-Point Method

Convert inequality constraints to equality constraints using slack variables
Form the barrier problem with logarithmic barrier terms
Solve the KKT conditions for the barrier problem using Newton’s method
Gradually reduce the barrier parameter to converge to the solution

Applications

Machine Learning and Data Science

Linear Regression

Ordinary Least Squares (OLS): $\min_{\mathbf{w}} |\mathbf{X}\mathbf{w} - \mathbf{y}|_2^2$
Ridge Regression: $\min_{\mathbf{w}} |\mathbf{X}\mathbf{w} - \mathbf{y}|_2^2 + \lambda|\mathbf{w}|_2^2$
Lasso Regression: $\min_{\mathbf{w}} |\mathbf{X}\mathbf{w} - \mathbf{y}|_2^2 + \lambda|\mathbf{w}|_1$
Elastic Net: $\min_{\mathbf{w}} |\mathbf{X}\mathbf{w} - \mathbf{y}|_2^2 + \lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2$

Classification

Logistic Regression: $\min_{\mathbf{w}} \sum_{i=1}^{n} \log(1 + \exp(-y_i \mathbf{w}^T\mathbf{x}_i)) + \lambda|\mathbf{w}|_2^2$
Support Vector Machines (SVM): $\min_{\mathbf{w},b} \frac{1}{2}|\mathbf{w}|2^2 + C\sum{i=1}^{n} \max(0, 1-y_i(\mathbf{w}^T\mathbf{x}_i + b))$

Neural Network Training

Stochastic Gradient Descent (SGD)
Adam, RMSprop, AdaGrad (adaptive learning rate methods)
L-BFGS for smaller networks

Operations Research

Resource Allocation

Maximize utility or profit subject to resource constraints
Applications in production planning, portfolio optimization, and scheduling

Network Flow Problems

Shortest path, maximum flow, minimum cost flow
Applications in transportation, logistics, and telecommunications

Integer Programming

Problems where some or all variables must be integers
Applications in facility location, vehicle routing, and crew scheduling

Control Systems

Model Predictive Control (MPC)

Optimization-based control strategy that predicts system behavior over a horizon
Solves an optimization problem at each time step to determine control inputs

Linear Quadratic Regulator (LQR)

Optimal control for linear systems with quadratic cost
Minimizes a weighted sum of state deviations and control effort

H-infinity Control

Robust control approach that minimizes the H-infinity norm of the closed-loop transfer function
Provides guarantees of stability and performance despite uncertainty

Computer Vision and Graphics

Image Registration

Aligning images by finding the optimal transformation parameters
Applications in medical imaging, remote sensing, and augmented reality

Structure from Motion

Reconstructing 3D structure from 2D image sequences
Involves optimization of camera poses and 3D point positions

Image Denoising and Reconstruction

Total Variation (TV) minimization: $\min_{\mathbf{u}} |\nabla \mathbf{u}|_1 + \lambda|\mathbf{u} - \mathbf{f}|_2^2$
Applications in medical imaging, photography, and computer vision

Economics and Finance

Portfolio Optimization

Mean-Variance Optimization: $\min_{\mathbf{w}} \mathbf{w}^T \mathbf{\Sigma} \mathbf{w} - \lambda \mathbf{w}^T \mathbf{\mu}$ subject to $\mathbf{1}^T\mathbf{w} = 1$
Black-Litterman model: Incorporates both market equilibrium and investor views

Applied optimization

Numerical Optimization Methods

Gradient Descent

Basic Algorithm

Types of Gradient Descent

Learning Rate Strategies

Advantages

Disadvantages

Newton’s Method

Algorithm

Advantages

Disadvantages

Quasi-Newton Methods

BFGS (Broyden-Fletcher-Goldfarb-Shanno) Algorithm

L-BFGS (Limited-memory BFGS)

Advantages

Disadvantages

Conjugate Gradient Method

Algorithm for Quadratic Functions

Nonlinear Conjugate Gradient

Advantages

Disadvantages

Trust Region Methods

Algorithm

Advantages

Disadvantages

Evolutionary Algorithms

Genetic Algorithm

Particle Swarm Optimization (PSO)

Advantages

Disadvantages

Convex Optimization

Convex Sets and Functions

Convex Set

Convex Function

Strictly Convex Function

Classes of Convex Optimization Problems

Linear Programming (LP)

Quadratic Programming (QP)

Second-Order Cone Programming (SOCP)

Semidefinite Programming (SDP)

Convex Optimization Algorithms

Interior-Point Methods

Cutting-Plane Methods

Subgradient Methods

Constrained Optimization Algorithms

Penalty Methods

External Penalty Method

Internal Penalty (Barrier) Method

Augmented Lagrangian Methods

Algorithm

Sequential Quadratic Programming (SQP)

Algorithm

Active Set Methods

Algorithm

Interior-Point Methods for Constrained Optimization

Primal-Dual Interior-Point Method

Applications

Machine Learning and Data Science

Linear Regression

Classification

Neural Network Training

Operations Research

Resource Allocation

Network Flow Problems

Integer Programming

Control Systems

Model Predictive Control (MPC)

Linear Quadratic Regulator (LQR)

H-infinity Control

Computer Vision and Graphics

Image Registration

Structure from Motion

Image Denoising and Reconstruction

Economics and Finance

Portfolio Optimization

Option Pricing and Risk Management

Game Theory Applications

See

Sessions by Day