Gradient Descent (Part — 1)

Ahmed Imam

Published in

Analytics Vidhya

5 min readApr 22, 2020

Learn By Example

What is gradient descent?

GD is an algorithm used upon other algorithms’ cost function in order to minimize it by getting gradients on function’s coefficients (i.e. get the derivative of cost function with respect to each of its coefficients).
By running GD we can get the best values of system coefficients which achieve the minimum cost (error) value.
These coefficients are used by machine learning model in the process of values’ prediction.

Example:

Suppose that:

We have the following linear system

Target Values : y = [1.4, 1.9, 3.2]
Inputs : x = [0.5, 2.3, 2.9]
𝑠𝑙𝑜𝑝𝑒 = 𝛽1 = 0.64
Our equation for linear regression:

Equation 1.1

Where:

(𝑦̂) is predicted output

𝑠𝑙𝑜𝑝𝑒 = 𝛽1

𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝛽0

And the cost function (residual-squared error):

Equation 1.2

Where:

(𝑦) is actual output

Tip To get the idea of “Gradient Descent” we will work first on “intercept-coeff.” then on both of coefficients together (intercept & slope). So a fixed value for ‘slope’ will be used ( 𝛽1= 0.64)

For our example, the cost (Sum_of_squared_residuals = SSR) in “equation 1.2” could be written as following (substituting by equation-1.1 instead of predicted (y) and put x-values and beta-1 value):

Equation 1.3

Draw relation between “cost” and “intercept”

Fig. 1: Relation between cost and intercept-coefficient.

Now we will go to minimize the cost function using “gradient-descent” with respect to “𝛽0” by get the derivative of equation-1.3 (using Chain-rule):

Equation 1.4

Equation-1.4 represents Slope at any point on the Cost-Curve.
Now our target is to get the minimum slope on cost-curve which could be presented by a tangent to cost-curve’s bottom or to the most point near the bottom.
At the point where the cost is minimum, we can get the corresponding “𝛽0” value.
To do that, we will start doing calculations with a random value of 𝛽0, suppose: 𝛽0=0 and learning-rate (𝛼=0.1)
Learning rate is used to determine the step-size be made by gradient-descent to reach the min-cost (at curve’s bottom) and best value of 𝛽0.
Also,” 𝛼 “ helps gradient-descent to not overshot the curve’s bottom. It’s a hyper-parameter has no fixed value, generally with a value between (0.1 and 0.001).

Calculations:

Get the slope of cost-curve at the arbitrary point where (𝛽0=0) as in fig.2 by substituting in Equation-1.4 (curve-slope = -5.7 at point-1).

Fig. 2 Slope at arbitrary point on cost-curve where (𝛽0=0)

2. Compute the step-size, gradient-descent will use to jump to next point on the curve.

Equation 1.5

3. Compute the new-𝛽0.

4. Repeat steps from “1” to “3” with the value of (new-𝛽0) again and again till the absolute value of step to be equal to or less than value of 0.001 (|step| ≤ 0.001) i.e. at this value the change in both step and new-𝛽0 will be too small.

5. At (|step| ≤ 0.001), gradient-descent approaches its goal to get the best value of system’s intercept-coeff. (𝛽0) with the minimum error (cost).

And hence our target is approached.

The following code is doing this:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Importing our dataset from CSV file:
dataset = pd.read_csv('./Datasets/weights.csv')
x = dataset['weight']
y = dataset['height']
# get relation between "intercept" coef. with squared error:
def sqr_err():
    sqError = []
    incpt = np.arange(0, 2, 0.1)
    slope = 0.64
    for i in incpt:
        y_pred = i + slope * x
        residual = sum((y_pred - y)**2)
        sqError.append(residual)
    plt.plot(incpt, sqError)
    plt.title('Intercept vs Squared_Error')
    plt.xlabel('Intercept')
    plt.ylabel('Squared Error')

The above part represents the cost curve relationship with 𝛽0.

# Draw Squared_Residual / Intercept curve:
sqr_err()# Initial value of intercept
i = 0
learningrate = 0.1while(True):
    # Draw a certain value of Squared_Error to get slope at
    # any point on cost-curve
    y_pred = i + 0.64 * x
    cost = sum((y_pred - y)**2)
    plt.plot(i, cost, c='r', marker='o')
    
    print(f'Intercept= {round(i,4)}')
    print(f'Seqared-Residual: {round(cost, 4)}')
    
# Draw tangent
    costSlope = -2*(1.4 - (i + 0.64 * 0.5)) - 2*(1.9 - (i + 0.64 * 2.3)) - 2*(3.2 - (i + 0.64 * 2.9))
    print(f'Cost_Slope: {round(costSlope, 4)}')
    line_x_values = np.arange(0, 1.2, 0.1)
    
    # tangent will pass through (0, tangent_intrcept) and (i, cost) and have slope "costSlope"
    tangent_intrcept = cost - (i * costSlope)
    tangent = tangent_intrcept + costSlope * line_x_values
    plt.plot(line_x_values, tangent)
    
    stepSize = learningrate * costSlope 
    if abs(stepSize) <= 0.001:
        print(f'Absolute value of Step Size is less than (0.001): {round(abs(stepSize), 4)}')
        break
    
    print(f'Step Size: {round(stepSize, 4)}')
    print('\n')
# New Intercept = Old Intercept - Step Size
    i = i - stepSize
    
plt.show()

Fig.3 How gradient descent works to get min. cost

The Second part of this tutorial will be about how to use gradient descent to get both of linear system coefficients (𝛽0 & 𝛽1) of equation-1.1 in the same approach our target which is min. cost.

For part 2 press here