Normal Distribution

One of the most useful bits of number-crunching you can do is to calculate the probability distribution of a set of data in the earnest hope that it will be a reasonable fit for one of the recognised distributions such as the normal distribution. In this project I will write a Python class to calculate the normal distribution for a given data set.

The Normal Distribution

Most people will have an intuitive understanding of the normal distribution even if they have no formal education in statistics. As an example let's think about people's heights: a few people are very short, rather more are fairly short, a lot are around average height, and then the numbers drop back down as we start to consider heights above average, eventually reaching zero.

If we draw a graph of heights and their respective frequencies we might expect to see something like this. (Note this is just fictitious data I have created for this project.)

normal distribution

The bars show actual data, whereas the red circles show the theoretical normal distribution for this particular data. As you can see there is a very good but not perfect fit, for example the frequency of 66 inches is 21, but the theoretical frequency assuming the data fits a normal distribution is 20.1516.

So how do we calculate, for example, 20.1516 from our data? Short answer: using the formula below, borrowed from Wikipedia. The variables used in this formula are:

  • x - the individual data value, eg. 21
  • μ - the arithmetic mean of the data
  • σ - the standard deviation of the data
normal distribution

The formula looks slightly intimidating but later on I'll implement it in Python to calculate the normal values for each item of data. In fact that is at the core of this whole project - to iterate a set of data and apply the above formula to calculate the values plotted as red circles.

The Plan

Before starting to code let's consider our requirements and how we are going to implement them. Basically what we want to end up with is a list of dictionaries with an item for each different value in the original data, and attributes for the following:

  • The value (eg. height)
  • The actual frequency of that value
  • The actual probability of that value
  • The theoretical frequency according to the normal distribution
  • The theoretical probability according to the normal distribution

For simplicity I am restricting the data to integers, each of which will be counted separately when calculating frequencies. A refinement would be to allow for real values and to group these into class intervals for the purposes of calculating frequencies. A further refinement could be to extend the code to calculate other types of probability distribution.

Coding

Create a new folder and within it create the following empty files. You can download the source code from the Downloads page or clone/download it from Github if you prefer.

  • data.py
  • normaldistribution.py
  • main.py

In the real world we would be getting our data from a database or some kind of file but for the purposes of this project I have written a function to populate an array with hard-coded sample data. In a couple of earlier projects on this site I have used randomly generated sample data but here we need data which roughly follows a normal distribution so have resorted to hard coding it. I have on my list of future projects a program to generate random data following certain probability distributions but for now this will have to suffice.

Open data.py and enter the function body.

data.py

def generate_data():

    """
    Create a list of data roughly following a normal distribution.
    """

    data = [66,64,65,68,69,71,69,67,66,64,66,60,65,68,66,66,67,66,60,
            63,69,64,66,68,65,65,71,65,70,67,66,69,67,68,70,60,66,63,
            67,67,68,68,63,72,69,71,63,69,67,64,67,71,68,62,64,66,68,
            65,65,66,62,67,63,67,64,67,67,64,69,69,64,62,63,63,70,65,
            68,66,67,66,64,62,68,71,63,65,67,66,69,66,67,70,70,66,69,
            66,64,62,64,64,61,67,65,67,68,69,71,65,61,71,69,67,61,66,
            61,66,65,61,65,70,63,65,64,70,67,62,66,68,66,65,68,65,65,
            70,68,68,62,63]

    return data

The generate_data function simply creates and returns a list of hard-coded sample data. Now open normaldistribution.py and enter or paste the following. I'll run through it in detail afterwards.

normaldistribution.py

import math

class NormalDistribution(object):

    """
    Has list attributes for data and the data's probability distribution, which
    is calculated by calling the calculate_prob_dist method. Also attributes for totals.
    """

    def __init__(self):

        """
        Create attributes with default values.
        """

        self.data = []
        self.probability_distribution = []

        self.total_frequency = 0
        self.total_probability = 0.0
        self.total_normal_probability = 0.0
        self.total_normal_frequency = 0.0

    #------------------------------------------------------------------------------------------------

    def calculate_prob_dist(self):

        """
        Iterate the data, adding dictionary to probability distribution list for new values
        or incrementing frequencies of existing ones.
        Then calculates actual probabilities, and normal probabilities and frequencies.
        """

        self.total_frequency = len(self.data)

        data_total = 0
        sum_of_squares = 0

        for item in self.data:

            # try to find current value in probability distribution list
            index = self.__index_of(item, self.probability_distribution)

            if index < 0:

                # if not present add to list
                self.probability_distribution.append({"value": item, "frequency": 1, "probability": 0, "normal_probability": 0, "normal_frequency": 0})

            else:

                # if present increment frequency
                self.probability_distribution[index]["frequency"] += 1

            data_total += item
            sum_of_squares += (item ** 2)

        # calculate mean, variance and standard deviation
        mean = data_total / self.total_frequency
        variance = (sum_of_squares - ((data_total ** 2) / self.total_frequency)) / self.total_frequency
        stddev = variance ** 0.5

        # calculate probabilities of each unique value
        for pd in self.probability_distribution:

            pd["probability"] = pd["frequency"] / len(self.data)

            pd["normal_probability"] = ((1.0 / (stddev * math.sqrt(2.0 * math.pi))) * (pow(math.e, -1.0 * ((pow((pd["value"] - mean), 2.0)) / ( variance * 2.0)))))

            pd["normal_frequency"] = pd["normal_probability"] * len(self.data)

            self.total_probability += pd["probability"]
            self.total_normal_probability += pd["normal_probability"]
            self.total_normal_frequency += pd["normal_frequency"]

        self.probability_distribution.sort(key = lambda k: k['value'])

    #------------------------------------------------------------------------------------------------

    def print_prob_dist(self):

        """
        Print details of each item in the probability distribution and totals.
        """

        print("Value | Probability | Normal Prob | Freq | Normal Freq\n------------------------------------------------------");

        for item in self.probability_distribution:

            print("{:5d} |{:12.4f} |{:12.4f} |{:5d} |{:12.4f}".format(item["value"], item["probability"], item["normal_probability"], item["frequency"], item["normal_frequency"]))

        print("------------------------------------------------------")

        print("      |{:12.4f} |{:12.6f} |{:5.0f} |{:12.4f}".format(self.total_probability, self.total_normal_probability, self.total_frequency, self.total_normal_frequency))

        print("------------------------------------------------------")

    #------------------------------------------------------------------------------------------------

    def __index_of(self, n, probdist):

        """
        Return index of n if already in probdist, or -1 if not.
        """

        for i in range(0, len(probdist)):

            if probdist[i]["value"] == n:

                return i

        return -1

The __init__ function simply sets up a few attributes with default values. The data attribute is to be set by calling code, the rest being calculated by the class.

Now we move on to calculate_prob_dist, the central method of the whole project.

After setting total_frequency and creating a couple of variables for running totals we loop through the data. For each value we need to add a new dictionary to the probability_distribution list with a frequency of 1 if the value is not already in the list, or just increment the relevant frequency if it is. Finding the index, if any, is farmed out to a separate function called __index_of. Also in the loop we increment data_total and sum_of_squares for later use.

We can now calculate the arithmetic mean, variance and standard deviation, before iterating the probability_distribution list to fill in the missing values.

For each discrete value we calculate the following three values:

  • probability - simply the frequency divided by size

  • normal_probability - this uses the equation shown above but repeated here
    normal distribution

  • normal_frequency - this is the normal probability multiplied by the number of data items

We also increment the three totals, total_probability, total_normal_probability and total_normal_frequency.

At the moment the probability_distribution list is in order of the first occurence of each value so finally we sort it. Note the key argument of the sort method is a lambda returning 'value'.

Now let's look at the print_prob_dist function. This prints out a heading and then iterates the probability_distribution list, printing out the contents in a table before printing out the totals.

Finally comes the __index_of function. This iterates the specified list to find the number n and returns its index or, if the loop ends without finding it, it returns -1.

Now the class and function to generate data are finished so we can try them out. Open main.py and type or paste this code.

main.py

import data
import normaldistribution

def main():

    """
    Demonstrate the NormalDistribution class with a dataset
    created using the data module.
    """

    print("-----------------------\n| code-in-python.com  |\n| Normal Distribution |\n-----------------------\n")

    d = data.generate_data()

    nd = normaldistribution.NormalDistribution()

    nd.data = d

    nd.calculate_prob_dist()

    nd.print_prob_dist()

#-----------------------------------------------------

main()

This is all quite straightforward - we just grab a list of sample data, create a NormalDistribution object and set its data attribute. We then call its calculate_prob_dist and print_prob_dist methods.

We have now finished coding so run the program with the following in your terminal.

Running

python3.6 main.py

This is the program output. The normal probability and frequency totals don't quite add up to 1 and 138 respectively due to roundings.

Program Output

-----------------------
| code-in-python.com  |
| Normal Distribution |
-----------------------

Value | Probability | Normal Prob | Freq | Normal Freq
------------------------------------------------------
   60 |      0.0217 |      0.0120 |    3 |      1.6597
   61 |      0.0362 |      0.0255 |    5 |      3.5189
   62 |      0.0507 |      0.0473 |    7 |      6.5242
   63 |      0.0725 |      0.0766 |   10 |     10.5773
   64 |      0.0942 |      0.1087 |   13 |     14.9952
   65 |      0.1232 |      0.1347 |   17 |     18.5894
   66 |      0.1522 |      0.1460 |   21 |     20.1516
   67 |      0.1377 |      0.1384 |   19 |     19.1024
   68 |      0.1087 |      0.1147 |   15 |     15.8342
   69 |      0.0870 |      0.0832 |   12 |     11.4773
   70 |      0.0580 |      0.0527 |    8 |      7.2747
   71 |      0.0507 |      0.0292 |    7 |      4.0320
   72 |      0.0072 |      0.0142 |    1 |      1.9542
------------------------------------------------------
      |      1.0000 |    0.983269 |  138 |    135.6912
------------------------------------------------------