Optimization problems can be divided into minimization and maximization categories, and maximization problems can always be transformed into equivalent minimization problems. Therefore, in the following text, we will focus on minimization problems. Gradient descent is an iterative optimization algorithm used to find the local minimum of a function. It gradually approaches the minimum point by moving in the opposite direction of the function's gradient.
Equivalent Problems
First, let us prove some lemmas about monotonicity.
If is a strictly increasing function, then
That is, when the output strictly increases, the input must strictly increase.
(←) By contradiction, if then , which is a contradiction. ∎
If is a strictly monotonic function, then is injective. That is,
(←) We need to prove injectivity. Taking strictly increasing on as an example. Let . Suppose , then by strict monotonicity, , which is a contradiction. Similarly , therefore . ∎
By the lemma, we know Thus for some point ,
This proves that a minimum point of is also a minimum point of , and a minimum point of is also a minimum point of . ∎
Based on this, for the optimization problem
it always has the same solution as , if is a strictly increasing function. Or , if is a strictly decreasing function.
Consider an interesting matrix optimization problem that demonstrates the equivalence of different objective functions under monotonic transformations.
Let be a non-negative matrix with column sum constraints: for all (where is a constant).
Define two objective functions:
(diagonal elements / off-diagonal elements)
(diagonal elements / all elements)
Due to the column sum constraint, the total sum of all elements is:
Therefore:
The sum of off-diagonal elements is:
So:
Key Observation: Let , then:
These two functions have a monotonic relationship:
Since is a strictly increasing function of (in the range ), we have:
The optimization problems and are equivalent and have the same optimal solution.
Unconstrained Case
is the -dimensional Euclidean space. , is a differentiable function. Finding the minimum point and minimum value of is the optimization problem
The iterative equation is
where the step size is a positive number that determines the update magnitude at each iteration. is the gradient of function at point . The algorithm's goal is to make as where . that is to say, is a minimum. The pseudocode is as follows:
Let us look at an example
where . First, we know through analytical methods that for
Equality holds if and only if . Therefore, the minimum value is achieved at the origin. Next, we use gradient descent to solve this.
We set the initial condition as , step size , and stopping condition as . After 212 iterations, the function value reaches 9.925765507684842e-21. The trajectory left by each iteration in the feasible region is shown in the figure below. The black solid lines in the figure are the contour lines of function , and the arrows indicate the gradient field. The coloring indicates the magnitude of the function value, with darker colors representing larger function values. The red points represent the positions updated at each iteration, and the connecting lines are the iteration trajectories. It can be seen that each iteration moves opposite to the gradient direction with step size proportional to the gradient magnitude, and the iteration points gradually approach the origin—the theoretical minimum point.
When we increase the step size to , after 44 iterations, the function value reaches 7.503260807194337e-21.
When we increase the step size to , after 35 iterations, the function value reaches 5.929230630780102e-21.
If the step size is increased to , it leads to divergence. Therefore
It is not difficult to see that the iteration speed is related to the step size, which can lead to divergence. Therefore, we need to choose an appropriate step size to ensure convergence.
Constrained Case
When dealing with constrained optimization problems, we need to find the minimum of a function subject to constraints. The general form is:
For constrained problems, we cannot simply move in the negative gradient direction as this may violate the constraints. Instead, we need to project the gradient onto the feasible region or use penalty methods.
Projected Gradient Method
The projected gradient method modifies the standard gradient descent by projecting each iteration onto the feasible set :
where denotes the projection operator onto the constraint set .
The pseudocode for the projected gradient method is:
Example: Linear Constraint
Consider the optimization problem:
where (same as the unconstrained case).
The feasible set is the line . The unconstrained minimum is not feasible, so we expect the constrained optimum to lie on the constraint.
The gradient is the same as in the unconstrained case.
For the linear constraint , the projection operation onto the line is:
Algorithm implementation:
We demonstrate the algorithm starting from the same initial point as the unconstrained case:
We set the initial point as , which lies off the constraint. The algorithm first projects this point onto the constraint , resulting in . After 1000 iterations, it converges to the constrained optimal point with function value 0.75.
The visualization below shows the optimization trajectory and the projection process. The black line represents the constraint .
For all iterations, the visualization shows (with later iterations becoming more transparent):
- Red arrows: gradient steps from current point to unconstrained update
- Orange dotted lines: projection steps from the unconstrained update back to the constraint
This clearly demonstrates how the projected gradient method alternates between taking gradient steps and projecting back to the feasible set. The gradient arrows show both the direction and magnitude of the descent step, while the projection steps ensure feasibility. The path successfully reaches the constrained optimal point .
Penalty Method
Another approach for handling constraints is the penalty method, where we convert the constrained problem into an unconstrained one by adding penalty terms:
where is the penalty parameter. As , the solution of the penalized problem approaches the solution of the original constrained problem.
Convergence
Next, we rigorously analyze the convergence of gradient descent.
Convex Optimization
If the optimization problem is convex, then gradient descent can guarantee finding the global minimum. The definition of a convex function is: for any and , we have