Python pandas график корреляции

Как создать корреляционную матрицу в Python

Чем дальше коэффициент корреляции от нуля, тем сильнее связь между двумя переменными.

Но в некоторых случаях мы хотим понять корреляцию между более чем одной парой переменных. В этих случаях мы можем создать матрица корреляции, представляющая собой квадратную таблицу, которая показывает коэффициенты корреляции между несколькими попарными комбинациями переменных.

В этом руководстве объясняется, как создать и интерпретировать корреляционную матрицу в Python.

Как создать матрицу корреляции в Python

Выполните следующие шаги, чтобы создать матрицу корреляции в Python.

Шаг 1. Создайте набор данных

import pandas as pd data = df = pd.DataFrame(data, columns=['assists','rebounds','points']) df # assist rebounds points #0 4 12 22 #1 5 14 24 #2 5 13 26 #3 6 7 26 #4 7 8 29 #5 8 8 32 #6 8 9 20 #7 10 13 14

Шаг 2. Создайте матрицу корреляции

#создать корреляционную матрицу df.corr() assists rebounds points assists 1.000000 -0.244861 -0.329573 rebounds -0.244861 1.000000 -0.522092 points -0.329573 -0.522092 1.000000 #создайте ту же матрицу корреляции с коэффициентами, округленными до 3 знаков после запятой df.corr().round(3) assists rebounds points assists 1.000 -0.245 -0.330 rebounds -0.245 1.000 -0.522 points -0.330 -0.522 1.000

Шаг 3. Интерпретация матрицы корреляции

Все коэффициенты корреляции по диагонали таблицы равны 1, потому что каждая переменная совершенна коррелирует сам с собой.

Все остальные коэффициенты корреляции указывают на корреляцию между различными попарными комбинациями переменных. Например:

  • Коэффициент корреляции между передачами и подборами равен -0.245
  • Коэффициент корреляции между передачами и очками равен -0.330 .
  • Коэффициент корреляции между подборами и очками равен -0.522

Шаг 4. Визуализируйте матрицу корреляции (необязательно)

Вы можете визуализировать матрицу корреляции с помощью параметры стиля доступны в pandas:

corr = df.corr()'coolwarm')

Correlation matrix in Python

Вы также можете изменить аргумент cmap , чтобы создать корреляционную матрицу с разными цветами.

corr = df.corr()'RdYlGn') 

Correlation matrix with matplotlib in Python

corr = df.corr()'bwr')

Correlation matrix using Pandas

Примечание: Полный список аргументов cmap см. в документация по matplotlib.


How To Plot Correlation Matrix in Pandas Python?

Stack Vidhya

In machine learning projects, statistical analysis is done on the datasets to identify how the variables are related to each other and how it is dependent on other variables. To find the relationship between the variables, you can plot the correlation matrix.

You can plot correlation matrix in the pandas dataframe using the df.corr() method.

What is a correlation matrix in python?

A correlation matrix is a matrix that shows the correlation values of the variables in the dataset.

When the matrix, just displays the correlation numbers, you need to plot as an image for a better and easier understanding of the correlation. A picture speaks a thousand times more than words.

If you’re in Hurry

You can use the below code snippet to plot correlation matrix in python.

corr = df.corr()'coolwarm')

If You Want to Understand Details, Read on…

In this tutorial, you’ll learn the different methods available to plot correlation matrices in Python.

Sample Dataframe

First, you’ll create a sample dataframe using the iris dataset from sklearn datasets library.

This will be used to plot correlation matrix between the variables.

import pandas as pd from sklearn import datasets iris = datasets.load_iris() df = pd.DataFrame(, columns=iris.feature_names) df["target"] = df.head()

The dataframe contains four features. Namely sepal length, sepal width, petal length, petal width. Let’s plot the correlation matrix of these features.

Dataframe Will Look Like

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

Finding Correlation Between Two Variables

In this section, you’ll calculate the correlation between the features sepal length and petal length.

The pandas dataframe provides the method called corr() to find the correlation between the variables. It calculates the correlation between the
two variables.

Use the below snippet to find the correlation between two variables sepal length and petal length.

correlation = df["sepal length (cm)"].corr(df["petal length (cm)"]) correlation 

The correlation between the features sepal length and petal length is around 0.8717 . The number is closer to 1 , which means these two features are highly correlated.

This is how you can find the correlation between two features using the pandas dataframe corr() method.

How to Infer Correlation between variables

There are three types of correlation between variables.

Positive Correlation

When two variables in a dataset increase or decrease together, then it is known as a positive correlation. A positive correlation is denoted by 1 .

For example, the number of cylinders in a vehicle and the power of a vehicle are positively correlated. If the Number of cylinders increases, then power also increased. If the number of cylinders decreases, then the power of the vehicle also decreases.

Negative Correlation

When one variable decreases and the other variable decrease or vice versa means, then it is known as a negative correlation. A negative correlation is denoted by -1 .

For example, the number of the cylinder in a vehicle and the mileage of a vehicle is negatively correlated. If the number of cylinders increases, then the mileage would be decreased. If the number of cylinders decreases, then the mileage would be increased.

Zero Correlation

If the variables don’t relate to each other, then it is known as zero correlation. Zero correlation is denoted by 0 .

For example, the color of the vehicle makes zero impact on the mileage. This means color and mileage are not correlated to each other.

Infer the number

With these correlation numbers, the number which is greater than 0 and as nearer to 1, it shows the positive correlation. When a number is less than 0 and as closes to -1 shows a negative correlation.

This is how you can infer the correlation between two variables using the numbers.

Next, you’ll see how to plot the correlation matrix using the seaborn and matplotlib libraries.

Plotting Correlation Matrix

In this section, you’ll plot the correlation matrix by using the background gradient colors. This internally uses the matplotlib library.

First, find the correlation between each variable available in the dataframe using the corr() method. The corr() method will give a matrix with the correlation values between each variable.

Now, set the background gradient for the correlation data. Then, you’ll see the correlation matrix colored.

corr = df.corr()'coolwarm')

The below image shows the correlation matrix.

Correlation Matrix Python

The dark color shows the high correlation between the variables and the light colors shows less correlation between the variables.

This is how you can plot the correlation matrix using the pandas dataframe.

Plotting Correlation HeatMap

In this section, you’ll learn how to plot correlation heatmap using the pandas dataframe data.

You can plot the correlation heatmap using the seaborn.heatmap(df.corr()) method.

Use the below snippet to plot the correlation heatmap.

import seaborn as sns sns.heatmap(df.corr()) plt.savefig("Plotting_Correlation_HeatMap.jpg")

This will plot the correlation as a heatmap as shown below.

Here also the dark color shows the high correlation between the values and the light colors shows less correlation between the variables.

Correlation heatmap using seaborn in python

Adding Title and Axes Labels

In this section, you’ll learn how to add title and the axes labels to the correlation heatmap you’re plotting using the seaborn library.

You can add title and axes labels using the heatmap.set(xlabel=’X Axis label’, ylabel=’Y axis label’, title=’title’).

After setting the values, you can use the method to plot the heat map with the x-axis label, y-axis label, and the title for the heat map.

Use the below snippet to add axes labels and titles to the heatmap.

import seaborn as sns import matplotlib.pyplot as plt hm = sns.heatmap(df.corr(), annot = True) hm.set(xlabel='\nIRIS Flower Details', ylabel='IRIS Flower Details\t', title = "Correlation matrix of IRIS data\n")

Correlation heatmap with title

Saving the Correlation Heatmap

You have plotted the correlation heatmap. Now, you’ll learn how you can save the heatmap for future reference.

You can save the correlation heatmap using the savefig(filname.png) method

It supports jpg and png format file exports.


This is how you can save the correlation heatmap.

Plotting Correlation Scatter Plot

In this section, you’ll learn how to plot the correlation scatter plot.

You can plot the correlation scatterplot using the seaborn.regplot() method.

It accepts two features for X-axis and Y-axis and the scatter plot will be plotted for these two variables.

It also supports drawing the linear regression fitting line in the scatter plot. You can enable it or disable it using the fit_reg parameter. By default, the parameter fit_reg is always True which means the linear regression fit line will be plotted by default.

With Linear Regression Fit Line

You can use the below snippet the plot the correlation scatterplot between the variables sepal length and sepal width. Here, the parameter fit_reg is not used. Hence the linear regression for line will be plotted by default.

import seaborn as sns # use the function regplot to make a scatterplot sns.regplot(x=df["sepal length (cm)"], y=df["sepal width (cm)"]) plt.savefig("Plotting_Correlation_Scatterplot_With_Regression_Fit.jpg")

You can see the correlation scatter plot with the linear regression fit line.

Correlation scatterplot using seaborn n python

Without Linear Regression Fit Line

You can use the below snippet the plot the correlation scatterplot between the variables sepal length and sepal width. Here, the parameter fit_reg =False is used. Hence the linear regression for line will not be plotted by default.

import seaborn as sns # use the function regplot to make a scatterplot sns.regplot(x=df["sepal length (cm)"], y=df["sepal width (cm)"], fit_reg=False) plt.savefig("Plotting_Correlation_Scatterplot_Without_Regression_Fit.jpg")

You can see the correlation scatter plot without the linear regression fit line.

output 15 0

This is how you can plot the correlation scatter plot between the two parameters using the seaborn library.

Plot Correlation Between Two Columns Pandas

In this section, you’ll learn how to plot correlation Between Two columns in pandas dataframe.

You can plot correlation between two columns of pandas dataframe using sns.regplot(x=df[‘column_1’], y=df[‘column_2’]) snippet.

Use the below snippet to plot correlation scatter plot between two columns in pandas

import seaborn as sns sns.regplot(x=df["sepal length (cm)"], y=df["petal length (cm)"])

You can see the correlation of the two columns of the dataframe as a scatterplot.

Correlation Scatter Plot using seaborn in pandas


To summarize, you’ve learned what is correlation, how to find the correlation between two variables, how to plot correlation matrix, how to plot correlation heatmap, how to plot correlation scatterplot with and without linear regression fit line. Additionally, you’ve also learned how to save the plotted images that can be used for future reference.

If you’ve any questions, comment below.

