scipy.stats.kstest#
Performs the (one-sample or two-sample) Kolmogorov-Smirnov test for goodness of fit.
The one-sample test compares the underlying distribution F(x) of a sample against a given distribution G(x). The two-sample test compares the underlying distributions of two independent samples. Both tests are valid only for continuous distributions.
Parameters : rvs str, array_like, or callable
If an array, it should be a 1-D array of observations of random variables. If a callable, it should be a function to generate random variables; it is required to have a keyword argument size. If a string, it should be the name of a distribution in scipy.stats , which will be used to generate random variables.
cdf str, array_like or callable
If array_like, it should be a 1-D array of observations of random variables, and the two-sample test is performed (and rvs must be array_like). If a callable, that callable is used to calculate the cdf. If a string, it should be the name of a distribution in scipy.stats , which will be used as the cdf function.
args tuple, sequence, optional
Distribution parameters, used if rvs or cdf are strings or callables.
N int, optional
Sample size if rvs is string or callable. Default is 20.
Defines the null and alternative hypotheses. Default is ‘two-sided’. Please see explanations in the Notes below.
Defines the distribution used for calculating the p-value. The following options are available (default is ‘auto’):
- ‘auto’ : selects one of the other options.
- ‘exact’ : uses the exact distribution of test statistic.
- ‘approx’ : approximates the two-sided probability with twice the one-sided probability
- ‘asymp’: uses asymptotic distribution of test statistic
An object containing attributes:
KS test statistic, either D+, D-, or D (the maximum of the two)
One-tailed or two-tailed p-value.
In a one-sample test, this is the value of rvs corresponding with the KS statistic; i.e., the distance between the empirical distribution function and the hypothesized cumulative distribution function is measured at this observation.
In a two-sample test, this is the value from rvs or cdf corresponding with the KS statistic; i.e., the distance between the empirical distribution functions is measured at this observation.
In a one-sample test, this is +1 if the KS statistic is the maximum positive difference between the empirical distribution function and the hypothesized cumulative distribution function (D+); it is -1 if the KS statistic is the maximum negative difference (D-).
In a two-sample test, this is +1 if the empirical distribution function of rvs exceeds the empirical distribution function of cdf at statistic_location, otherwise -1.
There are three options for the null and corresponding alternative hypothesis that can be selected using the alternative parameter.
- two-sided: The null hypothesis is that the two distributions are identical, F(x)=G(x) for all x; the alternative is that they are not identical.
- less: The null hypothesis is that F(x) >= G(x) for all x; the alternative is that F(x) < G(x) for at least one x.
- greater: The null hypothesis is that F(x) G(x) for at least one x.
Note that the alternative hypotheses describe the CDFs of the underlying distributions, not the observed values. For example, suppose x1 ~ F and x2 ~ G. If F(x) > G(x) for all x, the values in x1 tend to be less than those in x2.
Suppose we wish to test the null hypothesis that a sample is distributed according to the standard normal. We choose a confidence level of 95%; that is, we will reject the null hypothesis in favor of the alternative if the p-value is less than 0.05.
When testing uniformly distributed data, we would expect the null hypothesis to be rejected.
>>> import numpy as np >>> from scipy import stats >>> rng = np.random.default_rng() >>> stats.kstest(stats.uniform.rvs(size=100, random_state=rng), . stats.norm.cdf) KstestResult(statistic=0.5001899973268688, pvalue=1.1616392184763533e-23)
Indeed, the p-value is lower than our threshold of 0.05, so we reject the null hypothesis in favor of the default “two-sided” alternative: the data are not distributed according to the standard normal.
When testing random variates from the standard normal distribution, we expect the data to be consistent with the null hypothesis most of the time.
>>> x = stats.norm.rvs(size=100, random_state=rng) >>> stats.kstest(x, stats.norm.cdf) KstestResult(statistic=0.05345882212970396, pvalue=0.9227159037744717)
As expected, the p-value of 0.92 is not below our threshold of 0.05, so we cannot reject the null hypothesis.
Suppose, however, that the random variates are distributed according to a normal distribution that is shifted toward greater values. In this case, the cumulative density function (CDF) of the underlying distribution tends to be less than the CDF of the standard normal. Therefore, we would expect the null hypothesis to be rejected with alternative=’less’ :
>>> x = stats.norm.rvs(size=100, loc=0.5, random_state=rng) >>> stats.kstest(x, stats.norm.cdf, alternative='less') KstestResult(statistic=0.17482387821055168, pvalue=0.001913921057766743)
and indeed, with p-value smaller than our threshold, we reject the null hypothesis in favor of the alternative.
For convenience, the previous test can be performed using the name of the distribution as the second argument.
>>> stats.kstest(x, "norm", alternative='less') KstestResult(statistic=0.17482387821055168, pvalue=0.001913921057766743)
The examples above have all been one-sample tests identical to those performed by ks_1samp . Note that kstest can also perform two-sample tests identical to those performed by ks_2samp . For example, when two samples are drawn from the same distribution, we expect the data to be consistent with the null hypothesis most of the time.
>>> sample1 = stats.laplace.rvs(size=105, random_state=rng) >>> sample2 = stats.laplace.rvs(size=95, random_state=rng) >>> stats.kstest(sample1, sample2) KstestResult(statistic=0.11779448621553884, pvalue=0.4494256912629795)
As expected, the p-value of 0.45 is not below our threshold of 0.05, so we cannot reject the null hypothesis.
scipy.special.kolmogorov#
Complementary cumulative distribution (Survival Function) function of Kolmogorov distribution.
Returns the complementary cumulative distribution function of Kolmogorov’s limiting distribution ( D_n*\sqrt(n) as n goes to infinity) of a two-sided test for equality between an empirical and a theoretical distribution. It is equal to the (limit as n->infinity of the) probability that sqrt(n) * max absolute deviation > y .
Parameters : y float array_like
Absolute deviation between the Empirical CDF (ECDF) and the target CDF, multiplied by sqrt(n).
out ndarray, optional
Optional output array for the function results
Returns : scalar or ndarray
The value(s) of kolmogorov(y)
The Inverse Survival Function for the distribution
Provides the functionality as a continuous distribution
Functions for the one-sided distribution
kolmogorov is used by stats.kstest in the application of the Kolmogorov-Smirnov Goodness of Fit test. For historial reasons this function is exposed in scpy.special, but the recommended way to achieve the most accurate CDF/SF/PDF/PPF/ISF computations is to use the stats.kstwobign distribution.
Show the probability of a gap at least as big as 0, 0.5 and 1.0.
>>> import numpy as np >>> from scipy.special import kolmogorov >>> from scipy.stats import kstwobign >>> kolmogorov([0, 0.5, 1.0]) array([ 1. , 0.96394524, 0.26999967])
Compare a sample of size 1000 drawn from a Laplace(0, 1) distribution against the target distribution, a Normal(0, 1) distribution.
>>> from scipy.stats import norm, laplace >>> rng = np.random.default_rng() >>> n = 1000 >>> lap01 = laplace(0, 1) >>> x = np.sort(lap01.rvs(n, random_state=rng)) >>> np.mean(x), np.std(x) (-0.05841730131499543, 1.3968109101997568)
Construct the Empirical CDF and the K-S statistic Dn.
>>> target = norm(0,1) # Normal mean 0, stddev 1 >>> cdfs = target.cdf(x) >>> ecdfs = np.arange(n+1, dtype=float)/n >>> gaps = np.column_stack([cdfs - ecdfs[:n], ecdfs[1:] - cdfs]) >>> Dn = np.max(gaps) >>> Kn = np.sqrt(n) * Dn >>> print('Dn=%f, sqrt(n)*Dn=%f' % (Dn, Kn)) Dn=0.043363, sqrt(n)*Dn=1.371265 >>> print(chr(10).join(['For a sample of size n drawn from a N(0, 1) distribution:', . ' the approximate Kolmogorov probability that sqrt(n)*Dn>=%f is %f' % (Kn, kolmogorov(Kn)), . ' the approximate Kolmogorov probability that sqrt(n)*Dn%f is %f' % (Kn, kstwobign.cdf(Kn))])) For a sample of size n drawn from a N(0, 1) distribution: the approximate Kolmogorov probability that sqrt(n)*Dn>=1.371265 is 0.046533 the approximate Kolmogorov probability that sqrt(n)*Dn
Plot the Empirical CDF against the target N(0, 1) CDF.
>>> import matplotlib.pyplot as plt >>> plt.step(np.concatenate([[-3], x]), ecdfs, where='post', label='Empirical CDF') >>> x3 = np.linspace(-3, 3, 100) >>> plt.plot(x3, target.cdf(x3), label='CDF for N(0, 1)') >>> plt.ylim([0, 1]); plt.grid(True); plt.legend(); >>> # Add vertical lines marking Dn+ and Dn- >>> iminus, iplus = np.argmax(gaps, axis=0) >>> plt.vlines([x[iminus]], ecdfs[iminus], cdfs[iminus], color='r', linestyle='dashed', lw=4) >>> plt.vlines([x[iplus]], cdfs[iplus], ecdfs[iplus+1], color='r', linestyle='dashed', lw=4) >>> plt.show()