Python p-value from t-statistic
I have some t-values and degrees of freedom and want to find the p-values from them (it’s two-tailed). In the real world I would use a t-test table in the back of a Statistics textbook; how do I do the equivalent in Python? e.g. t-lookup(5, 7) = 0.00245 or something like that. I know in SciPy if I had arrays I could do scipy.stats.ttest_ind , but I don’t. I just have t-statistics and degrees of freedom.
Presumably stats tables exist for convenience instead of having to calculating those values from an equation. Given this is a computer program, why not use that equation directly instead?
It’s quite complicated. I would hope there was some method somewhere in some library that could do it for me.
2 Answers 2
As an exercise, we can calculate our ttest also directly without using the provided function, which should give us the same answer, and so it does:
tt = (sm-m)/np.sqrt(sv/float(n)) # t-statistic for mean pval = stats.t.sf(np.abs(tt), n-1)*2 # two-sided pvalue = Prob(abs(t)>tt) print 't-statistic = %6.3f pvalue = %6.4f' % (tt, pval) t-statistic = 0.391 pvalue = 0.6955
Why do they call it (stats.t.sf) a survival function? Is it actually the same as en.wikipedia.org/wiki/Survival_function
Follow-up to the accepted answer: The p-value of one-sided t-test is the p-value of the two-sided t-test divided by 2: «` pval_one_sided = stats.t.sf(np.abs(tt), n-1) «`
We can compute using the t.cdf() function too:
from scipy.stats import t t_stat = 2.25 dof = 15 # p-value for 2-sided test 2*(1 - t.cdf(abs(t_stat), dof)) # 0.03988800677091664 2*(t.cdf(-abs(t_stat), dof)) # 0.03988800677091648
The below figure shows how the critical region for 5% level of significance looks like for a 2-sided t-test. For the above example, we can see that the null hypothesis can be rejected.
scipy.stats.pearsonr#
Pearson correlation coefficient and p-value for testing non-correlation.
The Pearson correlation coefficient [1] measures the linear relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
This function also performs a test of the null hypothesis that the distributions underlying the samples are uncorrelated and normally distributed. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.
Parameters : x (N,) array_like
y (N,) array_like
Defines the alternative hypothesis. Default is ‘two-sided’. The following options are available:
- ‘two-sided’: the correlation is nonzero
- ‘less’: the correlation is negative (less than zero)
- ‘greater’: the correlation is positive (greater than zero)
Defines the method used to compute the p-value. If method is an instance of PermutationMethod / MonteCarloMethod , the p-value is computed using scipy.stats.permutation_test / scipy.stats.monte_carlo_test with the provided configuration options and other appropriate settings. Otherwise, the p-value is computed as documented in the notes.
An object with the following attributes:
Pearson product-moment correlation coefficient.
The p-value associated with the chosen alternative.
The object has the following method:
This computes the confidence interval of the correlation coefficient statistic for the given confidence level. The confidence interval is returned in a namedtuple with fields low and high. If method is not provided, the confidence interval is computed using the Fisher transformation [1]. If method is an instance of BootstrapMethod , the confidence interval is computed using scipy.stats.bootstrap with the provided configuration options and other appropriate settings. In some cases, confidence limits may be NaN due to a degenerate resample, and this is typical for very small samples (~6 observations).
Raised if an input is a constant array. The correlation coefficient is not defined in this case, so np.nan is returned.
Spearman rank-order correlation coefficient.
Kendall’s tau, a correlation measure for ordinal data.
The correlation coefficient is calculated as follows:
where \(m_x\) is the mean of the vector x and \(m_y\) is the mean of the vector y.
Under the assumption that x and y are drawn from independent normal distributions (so the population correlation coefficient is 0), the probability density function of the sample correlation coefficient r is ([1], [2]):
where n is the number of samples, and B is the beta function. This is sometimes referred to as the exact distribution of r. This is the distribution that is used in pearsonr to compute the p-value when the method parameter is left at its default value (None). The distribution is a beta distribution on the interval [-1, 1], with equal shape parameters a = b = n/2 — 1. In terms of SciPy’s implementation of the beta distribution, the distribution of r is:
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
The default p-value returned by pearsonr is a two-sided p-value. For a given sample with correlation coefficient r, the p-value is the probability that abs(r’) of a random sample x’ and y’ drawn from the population with zero correlation would be greater than or equal to abs(r). In terms of the object dist shown above, the p-value for a given r and length n can be computed as:
When n is 2, the above continuous distribution is not well-defined. One can interpret the limit of the beta distribution as the shape parameters a and b approach a = b = 0 as a discrete distribution with equal probability masses at r = 1 and r = -1. More directly, one can observe that, given the data x = [x1, x2] and y = [y1, y2], and assuming x1 != x2 and y1 != y2, the only possible values for r are 1 and -1. Because abs(r’) for any sample x’ and y’ with length 2 will be 1, the two-sided p-value for a sample of length 2 is always 1.
For backwards compatibility, the object that is returned also behaves like a tuple of length two that holds the statistic and the p-value.
Student, “Probable error of a correlation coefficient”, Biometrika, Volume 6, Issue 2-3, 1 September 1908, pp. 302-310.
C. J. Kowalski, “On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coefficient” Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 21, No. 1 (1972), pp. 1-12.
>>> import numpy as np >>> from scipy import stats >>> x, y = [1, 2, 3, 4, 5, 6, 7], [10, 9, 2.5, 6, 4, 3, 2] >>> res = stats.pearsonr(x, y) >>> res PearsonRResult(statistic=-0.828503883588428, pvalue=0.021280260007523286)
To perform an exact permutation version of the test:
>>> rng = np.random.default_rng() >>> method = stats.PermutationMethod(n_resamples=np.inf, random_state=rng) >>> stats.pearsonr(x, y, method=method) PearsonRResult(statistic=-0.828503883588428, pvalue=0.028174603174603175)
To perform the test under the null hypothesis that the data were drawn from uniform distributions:
>>> method = stats.MonteCarloMethod(rvs=(rng.uniform, rng.uniform)) >>> stats.pearsonr(x, y, method=method) PearsonRResult(statistic=-0.828503883588428, pvalue=0.0188)
To produce an asymptotic 90% confidence interval:
>>> res.confidence_interval(confidence_level=0.9) ConfidenceInterval(low=-0.9644331982722841, high=-0.3460237473272273)
And for a bootstrap confidence interval:
>>> method = stats.BootstrapMethod(method='BCa', random_state=rng) >>> res.confidence_interval(confidence_level=0.9, method=method) ConfidenceInterval(low=-0.9983163756488651, high=-0.22771001702132443) # may vary
There is a linear dependence between x and y if y = a + b*x + e, where a,b are constants and e is a random error term, assumed to be independent of x. For simplicity, assume that x is standard normal, a=0, b=1 and let e follow a normal distribution with mean zero and standard deviation s>0.
>>> rng = np.random.default_rng() >>> s = 0.5 >>> x = stats.norm.rvs(size=500, random_state=rng) >>> e = stats.norm.rvs(scale=s, size=500, random_state=rng) >>> y = x + e >>> stats.pearsonr(x, y).statistic 0.9001942438244763
This should be close to the exact value given by
>>> 1/np.sqrt(1 + s**2) 0.8944271909999159
For s=0.5, we observe a high level of correlation. In general, a large variance of the noise reduces the correlation, while the correlation approaches one as the variance of the error goes to zero.
It is important to keep in mind that no correlation does not imply independence unless (x, y) is jointly normal. Correlation can even be zero when there is a very simple dependence structure: if X follows a standard normal distribution, let y = abs(x). Note that the correlation between x and y is zero. Indeed, since the expectation of x is zero, cov(x, y) = E[x*y]. By definition, this equals E[x*abs(x)] which is zero by symmetry. The following lines of code illustrate this observation:
>>> y = np.abs(x) >>> stats.pearsonr(x, y) PearsonRResult(statistic=-0.05444919272687482, pvalue=0.22422294836207743)
A non-zero correlation coefficient can be misleading. For example, if X has a standard normal distribution, define y = x if x < 0 and y = 0 otherwise. A simple calculation shows that corr(x, y) = sqrt(2/Pi) = 0.797…, implying a high level of correlation:
>>> y = np.where(x 0, x, 0) >>> stats.pearsonr(x, y) PearsonRResult(statistic=0.861985781588, pvalue=4.813432002751103e-149)
This is unintuitive since there is no dependence of x and y if x is larger than zero which happens in about half of the cases if we sample x and y.
How SciPy calculates the p-value in pearsonr() function?
I have searched a lot but there is no explanation on how SciPy calculates the p-value for the correlation coefficient and why it is unreliable (started by SciPy on the function page) for data sets smaller than 500.
Looks like the docstring needs some work. I created an issue for this: github.com/scipy/scipy/issues/8789
1 Answer 1
scipy.stats.pearsonr computes the p value using the t distribution. (You can check the source code in the file stats.py on github.) This should definitely be mentioned in the docstring.
Here’s an example. First, import pearsonr and scipy’s implementation of the t distribution:
In [334]: from scipy.stats import pearsonr, t as tdist
Define x and y for this example:
In [335]: x = np.array([0, 1, 2, 3, 5, 8, 13]) In [336]: y = np.array([1.2, 1.4, 1.6, 1.7, 2.0, 4.1, 6.6])
Compute r and p for this data:
In [337]: r, p = pearsonr(x, y) In [338]: r Out[338]: 0.9739566302403544 In [339]: p Out[339]: 0.0002073053505382502
Now compute p again, by first computing the t statistic, and then finding twice the survival function for that t value:
In [340]: df = len(x) - 2 In [341]: t = r * np.sqrt(df/(1 - r**2)) In [342]: 2*tdist.sf(t, df) # This is the p value. Out[342]: 0.0002073053505382502
We get the same p value, as expected.
I don’t know the source of the statement «The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so». If anyone knows a citable reference, it should be added to the pearsonr docstring.