Correlation is a statistic that measures the degree to which two variables are related, or move together. There are multiple correlation statistics, but this tutorial refers to the most common one, or Pearson's correlation coefficient. It is a fairly lengthy calculation to carry out manually even with the aid of software. Fortunately it is useful enough that a number of libraries for Python have implemented methods for automatically calculating Pearson's correlation coefficient. This tutorial makes use of the popular pandas library to demonstrate how to generate correlations and a correlation matrix.
As mentioned above, correlation measures the relationship between two variables. The coefficent ranges in value from -1 to 1, with values near these extremes indicating a nearly perfect positive or negative relationship. Values close to 0 typically indicate no or weak relationships. The direction of the relationship is indicated by the sign. So that positive correlations indicate that the variable tend to move together, while negative correlations indicate the variable tend to move in opposite directions.
Pandas for Python is a powerful library for manipulating data. It contains many convenience functions for performing complex statistical analyses.
The demonstration includes some basic plotting and pandas DataFrame functionality, including methods for genreating descriptive statistics and generating correlation coefficients and correlation matrices, as well as basic data transformation and graphing multi-faceted scatter plot or, scatter matrix, with matplotlib for comparison.
This tutorial uses Python 3.6, but the code will work as is in Python 2.7. If you do not already have a Python scientific platform installed, there are several open source products out there including python(x,y), Canopy and Anaconda.
You can get a copy of the notebook used in the video here.