Gradient Descent - Uwni'Space

Optimization problems can be divided into minimization and maximization categories, and maximization problems can always be transformed into equivalent minimization problems. Therefore, in the following text, we will focus on minimization problems. Gradient descent is an iterative optimization algorithm used to find the local minimum of a function. It gradually approaches the minimum point by moving in the opposite direction of the function's gradient.

Equivalent Problems

First, let us prove some lemmas about monotonicity.

Proposition 1.

(Order Preservation of Strictly Monotonic Functions)

If $𝑓$ is a strictly increasing function, then

𝑥_{1} < 𝑥_{2} \leftrightarrow 𝑓 (𝑥_{1}) < 𝑓 (𝑥_{2})

(1)

That is, when the output strictly increases, the input must strictly increase.

Proof.

(→) This is the definition of strictly increasing,

𝑥_{1} < 𝑥_{2} \to 𝑓 (𝑥_{1}) < 𝑓 (𝑥_{2})

(←) By contradiction, if

𝑥_{1} \geq 𝑥_{2}

then

𝑓 (𝑥_{1}) \geq 𝑓 (𝑥_{2})

, which is a contradiction. ∎

Proposition 2.

(Strictly Monotonic Function

\to

Injection)

If $𝑓$ is a strictly monotonic function, then $𝑓$ is injective. That is,

𝑥_{1} = 𝑥_{2} \leftrightarrow 𝑓 (𝑥_{1}) = 𝑓 (𝑥_{2})

(2)

Proof.

(→) This follows from the definition of functions.
(←) We need to prove injectivity. Taking

𝑓

strictly increasing on

𝑋

as an example. Let

𝑥_{1}, 𝑥_{2} \in 𝑋, 𝑓 (𝑥_{1}) = 𝑓 (𝑥_{2})

. Suppose

𝑥_{1} < 𝑥_{2}

, then by strict monotonicity,

𝑥_{1} < 𝑥_{2} \to 𝑓 (𝑥_{1}) < 𝑓 (𝑥_{2})

, which is a contradiction. Similarly

𝑥_{1} ≯ 𝑥_{2}

, therefore

𝑥_{1} = 𝑥_{2}

. ∎

Proposition 3.

(Strictly Monotonic Functions Preserve Extreme Points)

Let

𝑌 \subseteq ℝ

𝑓 : 𝑋 \to 𝑌

be an arbitrary function, and

𝑔 : 𝑌 \to ℝ

be a strictly increasing function. Then

𝑔 \circ 𝑓

and

𝑓

have the same extreme points. Conversely, they have opposite extreme points.

Proof.

By the lemma, we know $𝑥_{1} \leq 𝑥_{2} \leftrightarrow 𝑔 (𝑥_{1}) \leq 𝑔 (𝑥_{2})$ Thus for some point $𝑥^{*} \in 𝑋$ , $\exists 𝛿 > 0, \forall 𝑥 \in 𝑈 (𝑥^{*}, 𝛿)$

𝑓 (𝑥^{*}) \leq 𝑓 (𝑥) \leftrightarrow 𝑔 (𝑓 (𝑥^{*})) \leq 𝑔 (𝑓 (𝑥))

(3)

This proves that a minimum point of $𝑓$ is also a minimum point of $𝑔 \circ 𝑓$ , and a minimum point of $𝑔 \circ 𝑓$ is also a minimum point of $𝑓$ . ∎

Based on this, for the optimization problem

\arg \min 𝑓 (𝑥)

(4)

it always has the same solution as $\arg \min 𝑔 \circ 𝑓 (𝑥)$ , if $𝑔$ is a strictly increasing function. Or $\arg \max 𝑔 \circ 𝑓 (𝑥)$ , if $𝑔$ is a strictly decreasing function.

Example 1.

Consider an interesting matrix optimization problem that demonstrates the equivalence of different objective functions under monotonic transformations.

Let $𝑿 = {(𝑥_{𝑖 𝑗})}_{𝑛 \times 𝑛}$ be a non-negative matrix with column sum constraints: $\sum_{𝑖 = 1}^{𝑛} 𝑥_{𝑖 𝑗} = 𝑐$ for all $𝑗$ (where $𝑐$ is a constant).

Define two objective functions:

𝑓_{1} (𝑿) = \frac{\sum_{𝑖 = 1}^{𝑛} 𝑥_{𝑖 𝑖}}{\sum_{𝑖 \neq 𝑗} 𝑥_{𝑖 𝑗}}

(5)

(diagonal elements / off-diagonal elements)

𝑓_{2} (𝑿) = \frac{\sum_{𝑖 = 1}^{𝑛} 𝑥_{𝑖 𝑖}}{\sum_{𝑖, 𝑗 = 1}^{𝑛} 𝑥_{𝑖 𝑗}}

(6)

(diagonal elements / all elements)

Due to the column sum constraint, the total sum of all elements is: $\sum_{𝑖, 𝑗 = 1}^{𝑛} 𝑥_{𝑖 𝑗} = 𝑛 𝑐$

Therefore: $𝑓_{2} (𝑿) = \frac{\sum_{𝑖 = 1}^{𝑛} 𝑥_{𝑖 𝑖}}{𝑛 𝑐}$

The sum of off-diagonal elements is: $\sum_{𝑖 \neq 𝑗} 𝑥_{𝑖 𝑗} = 𝑛 𝑐 - \sum_{𝑖 = 1}^{𝑛} 𝑥_{𝑖 𝑖}$

So: $𝑓_{1} (𝑿) = \frac{\sum_{𝑖 = 1}^{𝑛} 𝑥_{𝑖 𝑖}}{𝑛 𝑐 - \sum_{𝑖 = 1}^{𝑛} 𝑥_{𝑖 𝑖}}$

Key Observation: Let $𝑠 = \sum_{𝑖 = 1}^{𝑛} 𝑥_{𝑖 𝑖}$ , then:

$𝑓_{2} = \frac{𝑠}{𝑛 𝑐}$
$𝑓_{1} = \frac{𝑠}{𝑛 𝑐 - 𝑠}$

These two functions have a monotonic relationship: $𝑓_{1} = \frac{𝑓_{2}}{1 - 𝑓_{2}} \cdot \frac{1}{𝑛}$

Since $𝑓_{1}$ is a strictly increasing function of $𝑓_{2}$ (in the range $𝑓_{2} < 1$ ), we have:

The optimization problems $\max 𝑓_{1} (𝑿)$ and $\max 𝑓_{2} (𝑿)$ are equivalent and have the same optimal solution.

Unconstrained Case

$ℝ^{𝑛}$ is the $𝑛$ -dimensional Euclidean space. $𝒙 \in ℝ^{𝑛}$ , $𝑓 : ℝ^{𝑛} \to ℝ$ is a differentiable function. Finding the minimum point and minimum value of $𝑓$ is the optimization problem

\min_{𝒙 \in ℝ^{𝑛}} 𝑓 (𝒙)

(7)

The iterative equation is

𝒙_{𝑘 + 1} = 𝒙_{𝑘} - 𝛼 \nabla 𝑓 (𝒙_{𝑘})

(8)

where the step size $𝛼$ is a positive number that determines the update magnitude at each iteration. $\nabla 𝑓 (𝒙_{𝑘})$ is the gradient of function $𝑓$ at point $𝒙_{𝑘}$ . The algorithm's goal is to make $𝒙_{𝑘} \to 𝒙^{*}$ as $𝑘 \to \infty$ where $𝒙^{*} \in \arg \min 𝑓 (𝒙)$ . that is to say, $𝒙^{*}$ is a minimum. The pseudocode is as follows:

Let us look at an example

\min_{(𝑥_{1}, 𝑥_{2}) \in ℝ^{2}} 𝑓 (𝑥_{1}, 𝑥_{2})

(9)

where $𝑓 (𝑥_{1}, 𝑥_{2}) = 𝑥_{1}^{2} + 𝑥_{1} 𝑥_{2} + 𝑥_{2}^{2}$ . First, we know through analytical methods that for $𝑥_{1}, 𝑥_{2} \in ℝ$

𝑥_{1}^{2} + 𝑥_{1} 𝑥_{2} + 𝑥_{2}^{2} = (𝑥_{1}, 𝑥_{2}) (\begin{matrix} 1 & 1 / 2 \\ 1 / 2 & 1 \end{matrix}) (\begin{matrix} 𝑥_{1} \\ 𝑥_{2} \end{matrix}) \geq 0

(10)

Equality holds if and only if $𝑥_{1} = 0, 𝑥_{2} = 0$ . Therefore, the minimum value $0$ is achieved at the origin. Next, we use gradient descent to solve this.

\nabla 𝑓 (\begin{matrix} 𝑥_{1} \\ 𝑥_{2} \end{matrix}) = (\begin{matrix} 2 𝑥_{1} + 𝑥_{2} \\ 2 𝑥_{2} + 𝑥_{1} \end{matrix})

(11)

{(\begin{matrix} 𝑥_{1} \\ 𝑥_{2} \end{matrix})}_{𝑘 + 1} = {(\begin{matrix} 𝑥_{1} \\ 𝑥_{2} \end{matrix})}_{𝑘} - 𝛼 {(\begin{matrix} 2 𝑥_{1} + 𝑥_{2} \\ 2 𝑥_{2} + 𝑥_{1} \end{matrix})}_{𝑘}

(12)

We set the initial condition as $𝑥_{0} = {(1.0, 2.0)}^{T}$ , step size $𝛼 = 0.1$ , and stopping condition as $𝑓 \leq 10^{- 20}$ . After 212 iterations, the function value reaches 9.925765507684842e-21. The trajectory left by each iteration in the feasible region is shown in the figure below. The black solid lines in the figure are the contour lines of function $𝑓$ , and the arrows indicate the gradient field. The coloring indicates the magnitude of the function value, with darker colors representing larger function values. The red points represent the positions updated at each iteration, and the connecting lines are the iteration trajectories. It can be seen that each iteration moves opposite to the gradient direction with step size proportional to the gradient magnitude, and the iteration points gradually approach the origin—the theoretical minimum point.

When we increase the step size to $0.4$ , after 44 iterations, the function value reaches 7.503260807194337e-21.

When we increase the step size to $0.5$ , after 35 iterations, the function value reaches 5.929230630780102e-21.

If the step size is increased to $0.6$ , it leads to divergence. Therefore

Figure 1: Gradient Descent Visualization

It is not difficult to see that the iteration speed is related to the step size, which can lead to divergence. Therefore, we need to choose an appropriate step size to ensure convergence.

Constrained Case

When dealing with constrained optimization problems, we need to find the minimum of a function $𝑓 (𝒙)$ subject to constraints. The general form is:

\begin{matrix} \min_{𝒙 \in ℝ^{𝑛}} & 𝑓 (𝒙) \\ subject to & 𝑔_{𝑖} (𝒙) \leq 0, 𝑖 = 1, 2, \dots, 𝑚 \\ ℎ_{𝑗} (𝒙) = 0, 𝑗 = 1, 2, \dots, 𝑙 \end{matrix}

(13)

For constrained problems, we cannot simply move in the negative gradient direction as this may violate the constraints. Instead, we need to project the gradient onto the feasible region or use penalty methods.

Projected Gradient Method

The projected gradient method modifies the standard gradient descent by projecting each iteration onto the feasible set $𝒞$ :

𝒙_{𝑘 + 1} = Π_{𝒞} (𝒙_{𝑘} - 𝛼 \nabla 𝑓 (𝒙_{𝑘}))

(14)

where $Π_{𝒞}$ denotes the projection operator onto the constraint set $𝒞$ .

The pseudocode for the projected gradient method is:

Example: Linear Constraint

Consider the optimization problem:

\begin{matrix} \min_{(𝑥_{1}, 𝑥_{2}) \in ℝ^{2}} & 𝑓 (𝑥_{1}, 𝑥_{2}) \\ subject to & 𝑥_{2} = 1 \end{matrix}

(15)

where $𝑓 (𝑥_{1}, 𝑥_{2}) = 𝑥_{1}^{2} + 𝑥_{1} 𝑥_{2} + 𝑥_{2}^{2}$ (same as the unconstrained case).

The feasible set is the line $𝒞 = {(𝑥_{1}, 𝑥_{2}) : 𝑥_{2} = 1}$ . The unconstrained minimum $(0, 0)$ is not feasible, so we expect the constrained optimum to lie on the constraint.

The gradient is the same as in the unconstrained case.

For the linear constraint $𝑥_{2} = 1$ , the projection operation onto the line is:

Π_{𝒞} (𝒚) = (\begin{matrix} 𝑦_{1} \\ 1 \end{matrix})

(16)

Algorithm implementation:

{(\begin{matrix} 𝑥_{1} \\ 𝑥_{2} \end{matrix})}_{𝑘 + 1} = Π_{𝒞} ({(\begin{matrix} 𝑥_{1} \\ 𝑥_{2} \end{matrix})}_{𝑘} - 𝛼 {(\begin{matrix} 2 𝑥_{1} + 𝑥_{2} \\ 2 𝑥_{2} + 𝑥_{1} \end{matrix})}_{𝑘})

(17)

We demonstrate the algorithm starting from the same initial point as the unconstrained case:

We set the initial point as $𝑥_{0} = {(1.0, 2.0)}^{T}$ , which lies off the constraint. The algorithm first projects this point onto the constraint $𝑥_{2} = 1$ , resulting in $(1.0, 1.0)$ . After 1000 iterations, it converges to the constrained optimal point with function value 0.75.

The visualization below shows the optimization trajectory and the projection process. The black line represents the constraint $𝑥_{2} = 1$ .

For all iterations, the visualization shows (with later iterations becoming more transparent):

Red arrows: gradient steps $- 𝛼 \nabla 𝑓 (𝒙_{𝑘})$ from current point to unconstrained update
Orange dotted lines: projection steps from the unconstrained update back to the constraint

This clearly demonstrates how the projected gradient method alternates between taking gradient steps and projecting back to the feasible set. The gradient arrows show both the direction and magnitude of the descent step, while the projection steps ensure feasibility. The path successfully reaches the constrained optimal point $(- 0.5, 1.0)$ .

Figure 2: Constrained Gradient Descent Visualization

Penalty Method

Another approach for handling constraints is the penalty method, where we convert the constrained problem into an unconstrained one by adding penalty terms:

\min_{𝒙 \in ℝ^{𝑛}} 𝐿 (𝒙, 𝜌) = 𝑓 (𝒙) + 𝜌 \sum_{𝑖 = 1}^{𝑚} {\max (0, 𝑔_{𝑖 (𝒙)})}^{2} + 𝜌 \sum_{𝑗 = 1}^{𝑙} ℎ_{𝑗 (𝒙)}^{2}

(18)

where $𝜌 > 0$ is the penalty parameter. As $𝜌 \to \infty$ , the solution of the penalized problem approaches the solution of the original constrained problem.

Convergence

Next, we rigorously analyze the convergence of gradient descent.

Convex Optimization

If the optimization problem is convex, then gradient descent can guarantee finding the global minimum. The definition of a convex function is: for any $𝑥, 𝑦 \in ℝ^{𝑛}$ and $𝜆 \in [0, 1]$ , we have

𝑓 (𝜆 𝑥 + (1 - 𝜆) 𝑦) \leq 𝜆 𝑓 (𝑥) + (1 - 𝜆) 𝑓 (𝑦)

(19)