Optimization problems can be divided into minimization and maximization categories, and maximization problems can always be transformed into equivalent minimization problems. Therefore, in the following text, we will focus on minimization problems. Gradient descent is an iterative optimization algorithm used to find the local minimum of a function. It gradually approaches the minimum point by moving in the opposite direction of the function's gradient.

Equivalent Problems

First, let us prove some lemmas about monotonicity.

Proposition 1.
(Order Preservation of Strictly Monotonic Functions)

If 𝑓 is a strictly increasing function, then

𝑥1<𝑥2𝑓(𝑥1)<𝑓(𝑥2)
(1)

That is, when the output strictly increases, the input must strictly increase.

Proof.
(→) This is the definition of strictly increasing, 𝑥1<𝑥2𝑓(𝑥1)<𝑓(𝑥2)
(←) By contradiction, if 𝑥1𝑥2 then 𝑓(𝑥1)𝑓(𝑥2), which is a contradiction.
Proposition 2.
(Strictly Monotonic Function Injection)

If 𝑓 is a strictly monotonic function, then 𝑓 is injective. That is,

𝑥1=𝑥2𝑓(𝑥1)=𝑓(𝑥2)
(2)
Proof.
(→) This follows from the definition of functions.
(←) We need to prove injectivity. Taking 𝑓 strictly increasing on 𝑋 as an example. Let 𝑥1,𝑥2𝑋,𝑓(𝑥1)=𝑓(𝑥2). Suppose 𝑥1<𝑥2, then by strict monotonicity, 𝑥1<𝑥2𝑓(𝑥1)<𝑓(𝑥2), which is a contradiction. Similarly 𝑥1𝑥2, therefore 𝑥1=𝑥2.
Proposition 3.
(Strictly Monotonic Functions Preserve Extreme Points)
Let 𝑌, 𝑓:𝑋𝑌 be an arbitrary function, and 𝑔:𝑌 be a strictly increasing function. Then 𝑔𝑓 and 𝑓 have the same extreme points. Conversely, they have opposite extreme points.
Proof.

By the lemma, we know 𝑥1𝑥2𝑔(𝑥1)𝑔(𝑥2) Thus for some point 𝑥𝑋, 𝛿>0,𝑥𝑈(𝑥,𝛿)

𝑓(𝑥)𝑓(𝑥)𝑔(𝑓(𝑥))𝑔(𝑓(𝑥))
(3)

This proves that a minimum point of 𝑓 is also a minimum point of 𝑔𝑓, and a minimum point of 𝑔𝑓 is also a minimum point of 𝑓.

Based on this, for the optimization problem

argmin𝑓(𝑥)
(4)

it always has the same solution as argmin𝑔𝑓(𝑥), if 𝑔 is a strictly increasing function. Or argmax𝑔𝑓(𝑥), if 𝑔 is a strictly decreasing function.

Example 1.

Consider an interesting matrix optimization problem that demonstrates the equivalence of different objective functions under monotonic transformations.

Let 𝑿=(𝑥𝑖𝑗)𝑛×𝑛 be a non-negative matrix with column sum constraints: 𝑖=1𝑛𝑥𝑖𝑗=𝑐 for all 𝑗 (where 𝑐 is a constant).

Define two objective functions:

𝑓1(𝑿)=𝑖=1𝑛𝑥𝑖𝑖𝑖𝑗𝑥𝑖𝑗
(5)

(diagonal elements / off-diagonal elements)

𝑓2(𝑿)=𝑖=1𝑛𝑥𝑖𝑖𝑖,𝑗=1𝑛𝑥𝑖𝑗
(6)

(diagonal elements / all elements)

Due to the column sum constraint, the total sum of all elements is: 𝑖,𝑗=1𝑛𝑥𝑖𝑗=𝑛𝑐

Therefore: 𝑓2(𝑿)=𝑖=1𝑛𝑥𝑖𝑖𝑛𝑐

The sum of off-diagonal elements is: 𝑖𝑗𝑥𝑖𝑗=𝑛𝑐𝑖=1𝑛𝑥𝑖𝑖

So: 𝑓1(𝑿)=𝑖=1𝑛𝑥𝑖𝑖𝑛𝑐𝑖=1𝑛𝑥𝑖𝑖

Key Observation: Let 𝑠=𝑖=1𝑛𝑥𝑖𝑖, then:

  • 𝑓2=𝑠𝑛𝑐
  • 𝑓1=𝑠𝑛𝑐𝑠

These two functions have a monotonic relationship: 𝑓1=𝑓21𝑓21𝑛

Since 𝑓1 is a strictly increasing function of 𝑓2 (in the range 𝑓2<1), we have:

The optimization problems max𝑓1(𝑿) and max𝑓2(𝑿) are equivalent and have the same optimal solution.

Unconstrained Case

𝑛 is the 𝑛-dimensional Euclidean space. 𝒙𝑛, 𝑓:𝑛 is a differentiable function. Finding the minimum point and minimum value of 𝑓 is the optimization problem

min𝒙𝑛𝑓(𝒙)
(7)

The iterative equation is

𝒙𝑘+1=𝒙𝑘𝛼𝑓(𝒙𝑘)
(8)

where the step size 𝛼 is a positive number that determines the update magnitude at each iteration. 𝑓(𝒙𝑘) is the gradient of function 𝑓 at point 𝒙𝑘. The algorithm's goal is to make 𝒙𝑘𝒙 as 𝑘 where 𝒙argmin𝑓(𝒙). that is to say, 𝒙 is a minimum. The pseudocode is as follows:

Let us look at an example

min(𝑥1,𝑥2)2𝑓(𝑥1,𝑥2)
(9)

where 𝑓(𝑥1,𝑥2)=𝑥12+𝑥1𝑥2+𝑥22. First, we know through analytical methods that for 𝑥1,𝑥2

𝑥12+𝑥1𝑥2+𝑥22=(𝑥1,𝑥2)(11/21/21)(𝑥1𝑥2)0
(10)

Equality holds if and only if 𝑥1=0,𝑥2=0. Therefore, the minimum value 0 is achieved at the origin. Next, we use gradient descent to solve this.

𝑓(𝑥1𝑥2)=(2𝑥1+𝑥22𝑥2+𝑥1)
(11)
(𝑥1𝑥2)𝑘+1=(𝑥1𝑥2)𝑘𝛼(2𝑥1+𝑥22𝑥2+𝑥1)𝑘
(12)

We set the initial condition as 𝑥0=(1.0,2.0) T, step size 𝛼=0.1, and stopping condition as 𝑓1020. After 212 iterations, the function value reaches 9.925765507684842e-21. The trajectory left by each iteration in the feasible region is shown in the figure below. The black solid lines in the figure are the contour lines of function 𝑓, and the arrows indicate the gradient field. The coloring indicates the magnitude of the function value, with darker colors representing larger function values. The red points represent the positions updated at each iteration, and the connecting lines are the iteration trajectories. It can be seen that each iteration moves opposite to the gradient direction with step size proportional to the gradient magnitude, and the iteration points gradually approach the origin—the theoretical minimum point.

When we increase the step size to 0.4, after 44 iterations, the function value reaches 7.503260807194337e-21.

When we increase the step size to 0.5, after 35 iterations, the function value reaches 5.929230630780102e-21.

If the step size is increased to 0.6, it leads to divergence. Therefore

Figure 1: Gradient Descent Visualization

It is not difficult to see that the iteration speed is related to the step size, which can lead to divergence. Therefore, we need to choose an appropriate step size to ensure convergence.

Constrained Case

When dealing with constrained optimization problems, we need to find the minimum of a function 𝑓(𝒙) subject to constraints. The general form is:

min𝒙𝑛𝑓(𝒙) subject to 𝑔𝑖(𝒙)0, 𝑖=1,2,,𝑚𝑗(𝒙)=0, 𝑗=1,2,,𝑙
(13)

For constrained problems, we cannot simply move in the negative gradient direction as this may violate the constraints. Instead, we need to project the gradient onto the feasible region or use penalty methods.

Projected Gradient Method

The projected gradient method modifies the standard gradient descent by projecting each iteration onto the feasible set 𝒞:

𝒙𝑘+1=Π𝒞(𝒙𝑘𝛼𝑓(𝒙𝑘))
(14)

where Π𝒞 denotes the projection operator onto the constraint set 𝒞.

The pseudocode for the projected gradient method is:

Example: Linear Constraint

Consider the optimization problem:

min(𝑥1,𝑥2)2𝑓(𝑥1,𝑥2) subject to 𝑥2=1
(15)

where 𝑓(𝑥1,𝑥2)=𝑥12+𝑥1𝑥2+𝑥22 (same as the unconstrained case).

The feasible set is the line 𝒞={(𝑥1,𝑥2):𝑥2=1}. The unconstrained minimum (0,0) is not feasible, so we expect the constrained optimum to lie on the constraint.

The gradient is the same as in the unconstrained case.

For the linear constraint 𝑥2=1, the projection operation onto the line is:

Π𝒞(𝒚)=(𝑦11)
(16)

Algorithm implementation:

(𝑥1𝑥2)𝑘+1=Π𝒞((𝑥1𝑥2)𝑘𝛼(2𝑥1+𝑥22𝑥2+𝑥1)𝑘)
(17)

We demonstrate the algorithm starting from the same initial point as the unconstrained case:

We set the initial point as 𝑥0=(1.0,2.0) T, which lies off the constraint. The algorithm first projects this point onto the constraint 𝑥2=1, resulting in (1.0,1.0). After 1000 iterations, it converges to the constrained optimal point with function value 0.75.

The visualization below shows the optimization trajectory and the projection process. The black line represents the constraint 𝑥2=1.

For all iterations, the visualization shows (with later iterations becoming more transparent):

  • Red arrows: gradient steps 𝛼𝑓(𝒙𝑘) from current point to unconstrained update
  • Orange dotted lines: projection steps from the unconstrained update back to the constraint

This clearly demonstrates how the projected gradient method alternates between taking gradient steps and projecting back to the feasible set. The gradient arrows show both the direction and magnitude of the descent step, while the projection steps ensure feasibility. The path successfully reaches the constrained optimal point (0.5,1.0).

Figure 2: Constrained Gradient Descent Visualization

Penalty Method

Another approach for handling constraints is the penalty method, where we convert the constrained problem into an unconstrained one by adding penalty terms:

min𝒙𝑛𝐿(𝒙,𝜌)=𝑓(𝒙)+𝜌𝑖=1𝑚max(0,𝑔𝑖(𝒙))2+𝜌𝑗=1𝑙𝑗(𝒙)2
(18)

where 𝜌>0 is the penalty parameter. As 𝜌, the solution of the penalized problem approaches the solution of the original constrained problem.

Convergence

Next, we rigorously analyze the convergence of gradient descent.

Convex Optimization

If the optimization problem is convex, then gradient descent can guarantee finding the global minimum. The definition of a convex function is: for any 𝑥,𝑦𝑛 and 𝜆[0,1], we have

𝑓(𝜆𝑥+(1𝜆)𝑦)𝜆𝑓(𝑥)+(1𝜆)𝑓(𝑦)
(19)