Linear Regression

In a previous post I implemented the Pearson Correlation Coefficient, a measure of how much one variable depends on another. The three sets of bivariate data I used for testing and demonstration are shown again below, along with their corresponding scatterplots. As you can see these scatterplots now have lines of best fit added, their gradients and heights being calculated using least-squares regression which is the subject of this article.

Continue reading

Levenshtein Word Distance

A while ago I wrote an implementation of the Soundex Algorithm which attempts to assign the same encoding to words which are pronounced the same but spelled differently. In this post I'll cover the Levenshtein Word Distance algorithm which is a related concept measuring the "cost" of transforming one word into another by totalling the number of letters which need to be inserted, deleted or substituted.

The Levenshtein Word Distance has a fairly obvious use in helping spell checkers decided which words to suggest as alternatives to mis-spelled words: if the distance is low between a mis-spelled word and an actual word then it is likely that word is what the user intended to type. However, it can be used in any situation where strings of characters need to be compared, such as DNA matching.

Continue reading

Estimating Pi

Pi is an irrational number starting off 3.14159 and then carrying on for an infinite number of digits with no pattern which anybody has ever discovered. Therefore it cannot be calculated as such, just estimated to (in principle) any number of digits.

In this project I will code a few of the simpler methods in Python to give a decidedly non-rigorous introduction to what is actually a vast topic. If you want to do something rather more serious than playing with my code you can download the application called y-cruncher used to break the record here.

Continue reading

Pascal’s Triangle

The numbers in the graphic below form the first five rows of Pascal's Triangle, which in this post I will implement in Python. The first row consists of a single number 1. In subsequent rows, each of which is has one more number than the previous, values are calculated by adding the two numbers above left and above right. For the first and last values in each row we just take the single value above, therefore these are always 1.

Pascal's Triangle

Pascal's Triangle in its conventional centred layout

Continue reading


This is a relatively short project which will calculate a few simple statistics from a list of numbers. It covers the most basic areas of classical statistics which might seem a bit old-fashioned in an era of big data and machine learning algorithms, but even the most complex of data science investigations are likely to start out with a few simple statistics.

Continue reading

One-Dimensional Cellular Automaton

For this post I will write a simple implementation of a 1-dimensional cellular automaton in Python. The concept of cellular automata has existed since the middle of the 20th century and has grown into a vast field with many practical and theoretical applications.

A cellular automaton consists of any number of "cells" arranged in 1, 2, 3s or more dimensions. Each has a state associated with it (in the simplest case just on or off) and each cell and therefore the entire automaton transitions from one state to the next over time according to a rule or set of rules.

The number of dimensions, the number of possible cell states, and the rules can become arbitrarily large and complex but for this project I will implement the simplest type of one-dimensional cellular automaton, known as an elementary cellular automaton.

Continue reading

Finding Prime Numbers with the Sieve of Eratosthenes

Prime numbers have been understood at least since the Ancient Greeks, and possibly since the Ancient Egyptians. In modern times their study has intensified greatly due to their usefulness, notably in encryption, and because computers enable them to be calculated to a massively higher level than could be done by hand.

The best know (and according to Wikipedia still the most widely used) method for identifying them is the Sieve of Eratosthenes, which I will implement here in Python.

Continue reading

Calculating Great Circle Distances

The shortest distance between two locations on the surface of Earth (or any planet) is known as the Great Circle Distance. Except for relatively short distances these cannot be measured on a map due to the distortion and flattening necessary to represent a sphere on a flat plane. However, the calculation of the Great Circle Distance from two coordinates is simple, although I suspect generations of midshipmen might not have agreed. In this post I will write a short program in C to calculate the distance from London to various other cities round the world.

Continue reading