Understanding Batch Gradient Descent: A Key Optimization Algorithm in Machine Learning

Mar 03, 2025

In a previous published article (Understanding Linear Regression -Part 2), we introduced the concept of gradient descent as a fundamental optimization algorithm in machine learning. Gradient descent is an iterative method used to minimize a loss function by adjusting model parameters in the direction that reduces the error. Today, we’ll dive deeper into one of its core variants: Batch Gradient Descent. This algorithm is essential for understanding more advanced optimization techniques and remains a cornerstone in the training of many machine learning models.

What is Batch Gradient Descent?

Batch Gradient Descent is a specific implementation of the gradient descent algorithm. The term "batch" refers to the fact that it processes the entire training dataset (the batch) to compute the gradient of the loss function at each step. In other words, for every iteration, the algorithm calculates the average gradient of the loss function with respect to the model parameters using all available training examples.

This approach ensures that each parameter update is based on a comprehensive view of the entire dataset, leading to a stable and accurate convergence path toward the minimum of the loss function.

How Does Batch Gradient Descent Work?

The mechanics of Batch Gradient Descent can be broken down into the following steps:

Initialize the model parameters: Start with random values or predefined initial guesses for the parameters (e.g., weights in a neural network or coefficients in a linear model).
Compute the gradient: For each parameter, calculate the gradient of the loss function using the entire training dataset. The gradient points in the direction of the steepest increase in the loss function.
Update the parameters: Adjust the parameters by moving them in the direction opposite to the gradient (i.e., the direction of steepest decrease). This adjustment is scaled by a hyperparameter called the learning rate (denoted as η), which controls the step size.
Repeat: Continue this process iteratively until the algorithm converges to a minimum (i.e., when the loss function stops decreasing significantly) or until a predefined stopping criterion is met.

Mathematically, the parameter update rule for Batch Gradient Descent is:

\(\theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot \nabla_\theta J(\theta)\)

where:

θ\theta
represents the model parameters,
η
is the learning rate,
∇θJ(θ)
is the gradient of the loss function J with respect to θ, computed over the entire dataset.

A Simple Example: Linear Regression with Batch Gradient Descent

To make this concept more concrete, let’s consider a simple linear regression problem. Suppose we have a tiny dataset with two points:

(1,1) and (2,2). Our goal is to fit a line of the form y=θx (without an intercept for simplicity) that minimizes the mean squared error (MSE).

The loss function for this problem is:

\(J(\theta) = \frac{1}{2} \left[ (\theta \cdot 1 - 1)^2 + (\theta \cdot 2 - 2)^2 \right]\)

The gradient of the loss function with respect to θ is:

\(\nabla_\theta J(\theta) = 5\theta - 5\)

Now, let’s apply Batch Gradient Descent with a learning rate of η=0.1, starting from an initial guess of θ=0

Iteration 1:

Compute the gradient:
\(\nabla_\theta J(0) = 5 \cdot 0 - 5 = -5\)
Update θ:
\(\theta_{\text{new}} = 0 - 0.1 \cdot (-5) = 0 + 0.5 = 0.5\)

Iteration 2:

Compute the gradient:
\(\nabla_\theta J(0.5) = 5 \cdot 0.5 - 5 = 2.5 - 5 = -2.5\)
Update θ:
\(\theta_{\text{new}} = 0.5 - 0.1 \cdot (-2.5) = 0.5 + 0.25 = 0.75\)

Iteration 3:

Compute the gradient:
\(\nabla_\theta J(0.75) = 5 \cdot 0.75 - 5 = 3.75 - 5 = -1.25\)
Update θ:
\(\theta_{\text{new}} = 0.75 - 0.1 \cdot (-1.25) = 0.75 + 0.125 = 0.875\)

As we continue this process, θ will gradually approach the optimal value of θ=1, which perfectly fits the data points since both lie on the line y=x.

This example illustrates how Batch Gradient Descent iteratively refines the parameter estimate by considering the entire dataset at each step.

Advantages of Batch Gradient Descent

Batch Gradient Descent offers several key benefits:

Convergence to the global minimum (for convex functions): When the loss function is convex (e.g., in linear regression with MSE), Batch Gradient Descent is guaranteed to converge to the global minimum, provided the learning rate is appropriately chosen.
Stable and accurate gradient estimates: Since the gradient is computed using the entire dataset, it provides a precise direction for parameter updates, leading to a smooth convergence path.
Simplicity: The algorithm is straightforward to implement and understand, making it an excellent starting point for learning about optimization in machine learning.

Disadvantages of Batch Gradient Descent

Despite its advantages, Batch Gradient Descent has some notable drawbacks:

Computational intensity for large datasets: Each iteration requires processing the entire dataset, which can be slow and computationally expensive when dealing with large datasets (e.g., millions of samples). The time complexity per iteration is
\(O(ND)\)
where
\(N\)
is the number of samples and
\(D\)
is the number of features.
Memory requirements: The entire dataset must be loaded into memory to compute the gradient, which may not be feasible for very large datasets that exceed available memory.
Potential to get stuck in local minima (for non-convex functions): For non-convex loss functions (e.g., in neural networks), Batch Gradient Descent may converge to a local minimum rather than the global minimum, although this is a general issue with gradient-based optimization.

Learning Rate Selection

The choice of learning rate

\(\eta\)

is critical in Batch Gradient Descent. A learning rate that is too small will result in slow convergence, while a learning rate that is too large may cause the algorithm to overshoot the minimum or even diverge.

Imagine the loss function as a valley: a small learning rate takes tiny, cautious steps toward the bottom, which is safe but time-consuming. A large learning rate takes bigger leaps, which can speed up progress but risks jumping over the minimum entirely.

Convergence Criteria

To determine when to stop iterating, common stopping conditions include:

A fixed number of iterations.
When the change in the loss function between iterations falls below a certain threshold.
When the norm of the gradient becomes sufficiently small, indicating proximity to a minimum.

In practice, a combination of these criteria is often used to ensure timely and appropriate convergence.

Comparison with Other Variants

While Batch Gradient Descent is theoretically sound, its practical limitations for large datasets have led to the development of more efficient variants, such as:

Stochastic Gradient Descent (SGD): Updates parameters using only one randomly selected data point per iteration, which is faster but introduces more noise in the gradient estimates.
Mini-batch Gradient Descent: A compromise between Batch and Stochastic Gradient Descent, where gradients are computed using small subsets (mini-batches) of the dataset, balancing speed and stability.

These variants are particularly useful for training models on large-scale datasets and will be explored in future articles.

Computational Complexity and Memory

As mentioned earlier, the time complexity of Batch Gradient Descent is

\(O(ND)\)

per iteration, which can be prohibitive for large

\(N\)

Additionally, the need to load the entire dataset into memory can be a bottleneck. Although the gradient computation can be parallelized to some extent, synchronization overhead may limit the benefits of parallelization.

Conclusion

Batch Gradient Descent is a foundational optimization algorithm in machine learning, offering a clear and intuitive approach to minimizing loss functions. Its use of the entire dataset for each parameter update ensures stable convergence, making it ideal for smaller datasets or problems where computational resources are not a constraint. However, for large-scale machine learning tasks, its computational and memory demands often necessitate the use of more efficient variants like Stochastic or Mini-batch Gradient Descent.

Raven-R’s Substack

Discussion about this post