Linear Regression

In a previous post I implemented the Pearson Correlation Coefficient, a measure of how much one variable depends on another. The three sets of bivariate data I used for testing and demonstration are shown again below, along with their corresponding scatterplots. As you can see these scatterplots now have lines of best fit added, their gradients and heights being calculated using least-squares regression which is the subject of this article.

Data Set 1

xy
1032
2044
4068
4574
6092
6598
75110
80116

dataset1

Data Set 2

xy
1040
2040
4060
4580
6090
65110
75100
80130

dataset2

Data Set 3

xy
10100
2010
40130
4590
6040
6580
75180
8050

dataset3

y = ax + b

To draw the line of best fit we need the two end points which are then just joined up. The values of x are 0 and 90, so we need a formula in the form

General Form of Linear Equation

y = ax + b

We can then plug in 0 and 90 as the values of x to get the corresponding values for y. a represents the gradient or slope of the line, for example if y increases at the same rate as x then a will be 1, if it increases at twice the rate then a will be 2 etc. (The x and y axes might be to different scales so the apparent gradient might not be the same as the actual gradient.) b is the offset, or the amount the line is shifted up or down. If the scatterplot's x-axis starts at or passes through 0 then the height of the line at that point will be the same as b.

I have used the equation of the line in the form ax + b, but mx + c is also commonly used. Also, the offset is sometimes put first, ie. b + ax or c + mx.

Interpolation and Extrapolation

Apart from drawing a line to give an immediate and intuitive impression of the relationship between two variables, the equation given by linear or other kinds of regression can also be used to estimate values of y for values of x either within or outside the known range of values.

Estimating values within the current range of the independent variable is known as interpolation, for example in Data Set 1 we have no data for x = 30, but once we have calculated a and b we can use them to calculate an estimated value of y for x = 30.

The related process of estimating values outside the range of known x values (in these examples < 0 and > 90) is known as extrapolation.

The results of interpolation and extrapolation need to be treated with caution. Even if the known data fits a line exactly does not imply that unknown values will also fit, and of course the more imperfect the fit the more unlikely unknown values are to be on or near the regression line.

One reason for this is a limited set of data which might not be representative of all possible data. Another is that the range of independent variables might be small, fitting a straight line within a limited range but actually following a more complex pattern over a wider range of values.

As an example, imagine temperature data for a short period during spring. It might appear to rise linearly during this period, but of course we know it will level out in summer before dropping off through winter, before repeating an annual cycle. A linear model for such data is clearly useless for extrapolation.

Is Regression AI?

The concept of regression discussed here goes back to the early 19th Century, when of course it was calculated by hand using pencil and paper. Thefore the answer to the question "is it artificial intelligence" is obviously "don't be stupid, of course it isn't!"

The term artificial intelligence has been around since 1956, and existed as a serious idea (ie. beyond just sci-fi or fantasy) since the 1930s and was Alan Turing's original concept. The idea that for decades computers would be little more than calculators or filing systems would not have impressed him. This is perhaps longer than many people imagine but still nowhere near as far back as the early 1800s.

AI has has a chequered history, full of false starts, dead ends and disappointments, and has only started to become mainstream and actually useful in the past few years. This is due mainly to the emergence of machine learning, so AI and ML are sometimes now being used interchangeably.

The existence of very large amounts of data ("Big Data") plus the computing power to crunch that data using what are actually very simple algorithms has led to this revolution. With enough data and computing power you can derive generalised rules (ie. numbers) from samples which can be used to make inferences or decisions on other, similar, items of data.

Hopefully you can see what I am getting at here: carry out regression on the data you have, then use the results for interpolation and extrapolation about other possible data - almost a definition of Machine Learning.

Writing the Code

For this project we will write the code necessary to carry out linear regression, ie. calculate a and b, and then write a method to create sample lists of data corresponding to that shown above for testing and demonstrating the code.

Create a new folder and then create the following empty files:

  • linearregression.py
  • data.py
  • main.py

You can download the code as a zip from the Downloads page, or clone/download the Github repository if you prefer.

Open linearregression.py and type or paste the following.

linearregression.py

import math

class LinearRegression(object):
    """
    Implements linear regression on two lists of numerics

    Simply set independent_data and dependent_data,
    call the calculate method,
    and retrieve a and b
    """


    def __init__(self):
        """
        Not much happening here - just create empty attributes
        """


        self.independent_data = None
        self.dependent_data = None
        self.a = None
        self.b = None

    def calculate(self):
        """
        After calling separate functions to calculate a few intermediate values
        calculate a and b (gradient and offset).
        """


        independent_mean = self.__arithmetic_mean(self.independent_data)
        dependent_mean = self.__arithmetic_mean(self.dependent_data)
        products_mean = self.__mean_of_products(self.independent_data, self.dependent_data)
        independent_variance = self.__variance(self.independent_data)

        self.a = (products_mean - (independent_mean * dependent_mean) ) / independent_variance
        self.b = dependent_mean - (self.a * independent_mean)

    def __arithmetic_mean(self, data):
        """
        The arithmetic mean is what most people refer to as the "average",
        or the total divided by the count
        """


        total = 0

        for d in data:
            total += d

        return total / len(data)

    def __mean_of_products(self, data1, data2):
        """
        This is a type of arithmetic mean, but of the products of corresponding values
        in bivariate data
        """


        total = 0

        for i in range(0, len(data1)):
            total += (data1[i] * data2[i])

        return total / len(data1)

    def __variance(self, data):
        """
        The variance is a measure of how much individual items of data typically vary from the
        mean of all the data.
        The items are squared to eliminate negatives.
        (The square root of the variance is the standard deviation.)
        """


        squares = []

        for d in data:
            squares.append(d**2)

        mean_of_squares = self.__arithmetic_mean(squares)
        mean_of_data = self.__arithmetic_mean(data)
        square_of_mean = mean_of_data**2
        variance = mean_of_squares - square_of_mean

        return variance

The init method just creates a few default values. The calculate method is central to this project as it actually calculates a and b. To do this it needs a few intermediate variables: the means of the two sets of data, the mean of the products of the corresponding data items, and the variance. These are all calculated by separate functions. Finally we calculate a and b using the formulae which should be clear from the code.

The arithmetic mean is what most people think of as the average, ie the total divided by the count.

The mean of products is also an arithmetic mean, but of the products of each pair of values.

The variance is, as I have described in the docstring, "a measure of how much individual items of data typically vary from the mean of all the data".

Now let's move on to data.py.

data.py

def populatedata(independent, dependent, dataset):
    """
    Populates the lists with one of three datasets suitable
    for demonstrating linear regression code
    """


    del independent[:]
    del dependent[:]

    if dataset == 1:
        independent.extend([10,20,40,45,60,65,75,80])
        dependent.extend([32,44,68,74,92,98,110,116])
        return True

    elif dataset == 2:
        independent.extend([10,20,40,45,60,65,75,80])
        dependent.extend([40,40,60,80,90,110,100,130])
        return True

    elif dataset == 3:
        independent.extend([10,20,40,45,60,65,75,80])
        dependent.extend([100,10,130,90,40,80,180,50])
        return True

    else:
        return False

The populatedata method takes two lists and, after emptying them just in case they are being reused, adds one of the three pairs of datasets listed earlier.

Now we can move on to the main function and put the above code to use.

main.py

import data
import linearregression

def main():
    """
    Demonstrate the LinearRegression class with three sets of test data
    provided by the data module
    """


    print("----------------------\n| code-in-python.com |\n| Linear Regression  |\n----------------------\n")

    independent = []
    dependent = []

    lr = linearregression.LinearRegression()

    for d in range(1, 4):

        if data.populatedata(independent, dependent, d) == True:

            lr.independent_data = independent
            lr.dependent_data = dependent

            lr.calculate()

            print("Dataset %d\n---------" % d)
            print("Independent data: " + (str(lr.independent_data)))
            print("Dependent data:   " + (str(lr.dependent_data)))
            print("y = %gx + %g" % (lr.a, lr.b))

            print("")

        else:
            print("Cannot populate Dataset %d" % d)

#-----------------------------------------------------------------------------
main()

Firstly we create a pair of empty lists and a LinearRegression object. Then in a loop we call populatedata and set the LinearRegression object's data lists to the local lists. Next we call the calculate method and print the results.

That's the code finished so we can run it with the following in Terminal:

Running the program

python3.6 main.py

The program output shows each of the three sets of data, along with the formulae of their respective lines of best fit.

Program Output

----------------------
| code-in-python.com |
| Linear Regression  |
----------------------

Dataset 1
---------
Independent data: [10, 20, 40, 45, 60, 65, 75, 80]
Dependent data:   [32, 44, 68, 74, 92, 98, 110, 116]
y = 1.2x + 20

Dataset 2
---------
Independent data: [10, 20, 40, 45, 60, 65, 75, 80]
Dependent data:   [40, 40, 60, 80, 90, 110, 100, 130]
y = 1.24249x + 19.9022

Dataset 3
---------
Independent data: [10, 20, 40, 45, 60, 65, 75, 80]
Dependent data:   [100, 10, 130, 90, 40, 80, 180, 50]
y = 0.441649x + 63.1936

In the section above on interpolation and extrapolation, I used x = 30 in Data Set 1 as an example of missing data which could be estimated by interpolation. Now we have values of a and b for that data we can use them as follows:

Interpolation Example

y = 1.2 * 30 + 20
= 56

Please follow Code in Python on Twitter to keep up to date with new posts.