ANOVA, T-test and other statistical tests with Python
Analysis of the main statistical tests (ANOVA, T-test, MANOVA etc.) and their characteristics, applying them in Python.
Now that we know the criteria for choosing between the various tests, you can use the flowchart below to select the one that’s right for you. Now that the boring part is over, we can look at some code. 😝
P.S. There are also other statistical tests as alternatives to those proposed, which perform the same functions. To avoid making the article too long, they have been omitted.
Python libraries for statistical tests
The most famous and supported python libraries that collect the main statistical tests are:
- Statsmodel: a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
- Pingouin: an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy.
- Scipy: a Python-based ecosystem of open-source software for mathematics, science, and engineering.
Testing the assumptions
As for the independence assumption, this must be known a priori by you, there is no way to extrapolate it from the data. For the other two assumptions instead: we can use Scipy (data can be downloaded here):
from scipy import stats
import pandas as pd# import the data
df= pd.read_csv("Iris_Data.csv")
setosa = df[(df['species'] == 'Iris-setosa')]
versicolor = df[(df['species'] == 'Iris-versicolor')]# homogeneity
stats.levene(setosa['sepal_width'], versicolor['sepal_width'])# Shapiro-Wilk test for normality
stats.shapiro(setosa['sepal_width'])
stats.shapiro(versicolor['sepal_width'])
Output: LeveneResult(statistic=0.66, pvalue=0.417)
The test is not significant (huge p-value), meaning that there is homogeneity of variances and we can proceed.
Output: (0.968, 0.204)(0.974, 0.337)
Neither test for normality was significant, so neither variable violates the assumption. Both tests were successful. As for independence, we can assume it a priori knowing the data. We can proceed as planned.
T-test
To conduct the Independent t-test, we can use the stats.ttest_ind() method:
stats.ttest_ind(setosa['sepal_width'], versicolor['sepal_width'])
Output: Ttest_indResult(statistic=9.282, pvalue=4.362e-15)
The Independent t-test results are significant (p-value very very small)! Therefore, we can reject the null hypothesis in support of the alternative hypothesis.
If you want to use the non-parametric version, just replace stats.ttest_ind with stats.wilcoxon.
ANOVA
To apply ANOVA, we rely on Pingouin. We use a dataset included in the library:
import pingouin as pg# Read an example dataset
df = pg.read_dataset('mixed_anova')
# Run the ANOVA
aov = pg.anova(data=df, dv='Scores', between='Group', detailed=True)
print(aov)
As we can see we have a p-value below the threshold, so there is a significant difference between the various groups! Unfortunately, having more than two groups, we cannot know which of them is the difference. To find out, you need to apply T-test in pairs. It can be done via the method pingouin.pairwise_ttests.
In case you cannot ensure independency, use repeated measure ANOVA:
pg.rm_anova(data=df, dv='Scores', within='Time', subject='Subject', detailed=True)
MANOVA
In this example, we go back to using the initial dataset. We are going to use the width and length columns as dependent variables. Besides, the species column is used as the independent variable.
MANOVA is currently only implemented by the Statsmodel library. One of the main features of this library is that it uses R-style formulas to pass parameters to models.
from statsmodels.multivariate.manova import MANOVAmaov = MANOVA.from_formula('Sepal_Length + Sepal_Width + \
Petal_Length + Petal_Width ~ Species', data=df)print(maov.mv_test())
The p-value to consider in this case is that of Wilks’ lambda, relative to the output variable (Species). As we can see, even in this case it is significant.
We can consider this short guide to statistical tests finished. I hope it helped to clarify the concepts and avoided unnecessary headaches. 😄
Introduction
Imagine you are working on Exploratory Data Analysis or deciding on variables for feature engineering. However, you are not sure if there are significant differences among the groups in the variables to confirm particular hypotheses or warrant the variables to be used for modeling. Worse still, you may grapple with the appropriate tests to use. Sounds familiar?
Indeed, The main driver for this article stems from an effort to deepen understanding and note down for future reference, some of the common statistical methods and their application context. For each of the methods, the following are outlined
- a brief introduction of the statistical test
- worked example using dummy data
- interpretation of the results
- discussion on variations in use case (if any) & implementations.
W hich type of statistical tests to use is influenced by the variables & their measurement levels. Variables can either be:
- Numerical (measurements can be discrete or continuous; intervals or in ratio)
- Categorical (measurement are discrete; either nominal or ordinal)
- Nominal — no hierarchical sequence in the measurement levels.
- Ordinal — having a specific sequence in measurement levels.
Thinking about the independent variables (predictors) and dependent variable (target), the statistical tests may be arranged as such:
Examples & Code Implementations
# Import libraries
import pandas as pd
import numpy as np
import random
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.stats.contingency_tables as ct
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.weightstats import ztest
import matplotlib.pyplot as plt
plt.style.use('default')
plt.style.use ('seaborn-darkgrid')