I recently posted an article on Zipf's Law and the application of the Zipfian Distribution to word frequencies in a piece of text. A closely related concept is Benford's Law which describes the distribution of the first* digits of many, if not most, sets of numeric data. In fact the two are so closely related that the Benford Distribution can be considered as special case of the Zipfian Distribution.Continue reading
Zipf's Law describes a probability distribution where each frequency is the reciprocal of its rank multiplied by the highest frequency. Therefore the second highest frequency is the highest multiplied by 1/2, the third highest is the highest multiplied by 1/3 and so on.
This is best illustrated with a graph.
In this post I will write a project in Python to apply Zipf's Law to what is probably it's best known use, that of analysing word frequencies in a piece of text.Continue reading
There is a baffling selection of reporting software out there with very sophisticated functionality and users can put together reports impressive enough to satisfy any manager or board.
However, many people put pragmatics over aesthetics and will say "can't I just get the data in a spreadsheet?"
In this post I will put together a very simple solution to the problem of exporting data from PostgreSQL to an Excel spreadsheet using psycopg2 for the database access and openpyxl for the spreadsheet creation.Continue reading
Everyone understands averages, both their meaning and how to calculate them. However, there are situations, particularly when dealing with real-time data, when a conventional average is of little use because it includes old values which are no longer relevant and merely give a misleading impression of the current situation.
The solution to this problem is to use moving averages, ie. the average of the most recent values rather than all values, which is the subject of this post.Continue reading
So, your child gets 78% in both physics and history. Both pretty good grades but as the reader of geeky blogs like this you believe the sciences are more important than the humanities and would have preferred your child to do better in physics than history.
However, we are not necessarily comparing like with like here: 78% in one subject is probably not equivalent to 78% in another. Rather than the absolute percentages we need to calculate and compare the Z-Scores which take into account the averages and ranges of the entire set of scores.
In a previous post I implemented a very simple and very insecure substitution cypher. It is insecure because each letter in the original text is always encrypted the same way, for example the most common letter “e” might always be encrypted as “h”, so if we find that “h” is the most common letter in the encrypted text then we can assume it represents “e”. This can be carried out for all letters, a process called frequency analysis which is the subject of this post.
A core principle of relational databases is that a database's schema, or the design of its tables, columns and other objects, is held within the database itself; this means we can retrieve the structure using ordinary SQL queries. In this post I will develop a simple module which uses the psycopg2 DB-API interface to retrieve the tables and columns of a PostgreSQL database.Continue reading
In a previous post I introduced the psycopg2 Python/PostgreSQL interface and used it to create a database, a few tables and a view. In this post I will demonstrate inserting, updating, deleting and selecting data using the database created in the previous post, as well as showing what happens if we try to violate database constraints.Continue reading