## How to reach the minimum of a cost function

Todays blog is all about gradient descent, explained through the example of linear regression. *Gradient descent* is used to find the best fit for a straight line through a cloud of data points. Therefore, it minimizes a cost function. But before we go into overdrive, let’s start with a brief recap of linear regression.

*Linear Regression: Fitting a straight line through a cloud of data point*

As explained, linear regression is the process of fitting a straight line through a cloud of data points. It can be used to (for example) predict the price of used cars (*ypred*)based on the age of those cars (*x*). The fitted straight line is defined by the following equation:

The idea behind linear regression is to find the optimal a and b parameter. If a and b are optimized, then we’ll get the best fitting line. But in order to find it, we need an evaluation methodology. Exactly, gradient descent!

How does it work? Well, we typically start of with a random choice of a and b. Then, we calculate the predictions (e.g. car price) for each *x* (e.g. age of the car). If we’ve got all the predicted values, then we can compare them to what we actually observed. The difference between the observations and predictions is called the *error*. And the error on its turn is a measure for a so called cost function. And as stated in the title, we are going to minimize the cost function. Later more about it.

*The black vertical lines visualize the error between the predictions (red line) and observations (blue dots).*

Of course, we don’t look at the absolute error value for each separate data point, but we calculate a measure that can tell us something about the mean error. Therefore, two measures are typically used: the root-mean-square-error (RMSE) and the mean-absolute-error (MAE).

The *MAE* is simply the mean of the absolute error values between the observations and the predictions.

The *RMSE* is the square root of the mean of the squared error values between observations and predictions.

In linear regression, the cost function (which tells us what the error is for each combination of a and b) can be visualized as a convex (bowl shaped) shaped function. Therefore, it’s easy to find the minimum of the bowl, because there is only one place where the derivative of the function is zero. However, more complicated cost functions (think of some weird shaped bowl), typically of more complex models than a linear model, can have local minima (i.e. this is a minimum, but not the deepest one in the bowl), in which case finding the global minimum (i.e. the deepest point of the bowl) becomes a little more difficult. But no worries, this can be solved!

*Convex cost function (bowl)*

Let’s talk about g*radient descent*. Gradient descent basically is the methodolody used to find the global minimum of a cost function. So let’s discuss its simplest form, by the example of linear regression.

The standard approach of gradient descent is based on calculating *derivatives*. So, imagine a simple bowl. The derivative tells you how steep the slope (also called gradient) of that bowl is, and whether you’re descending or ascending. If the derivative is negative, then you’re descending, and the other way around.

Now, if you want to find a minimum, you should be going down, right?

Which means that we can come up with an algorithm that takes a step, then calculates the derivative of the point where it ended up. If the derivative is negative, the algorithms keeps walking in that direction, if not then the algorithm starts walking the another direction. I can easily be compared to how we walk down a mountain.

The step size is called the learning rate and is denoted with an alpha.

It’s important to choose the learning rate in a wise way, because if it’s too large, then the algorithm will start oscillating around the minimum because you step over it. If it’s too small, then we just use unnecessary time and resources to find the mimimum.

So, now that you understand why it’s called *gradient descent*. Let’s end with a quick note on more complex gradient descent, which is needed when you’ve got a more complex cost function. In this case you could use the Adam optimizer, or stochastic gradient descent. An Adam optimizer kind of gives you a momentum (or velocity) while walking down the mountain, in which way you can use your momentum to get over the local minima, instead of getting stuck in it. Stochastic gradient descent uses a different *sample* of your dataset during each step. This means that the cost function is different for each step, and therefore it’s less likely to get stuck in a local minimum, because in some sample, the local minimum might be visible, in the next maybe not.

Got it? 😅

This blog was originally published here.