Python для анализа данных¶
Нам пригодится только модуль scipy.stats . Полное описание доступно по ссылке. По ссылке можно прочитать полную документацию по работе с непрерывными (Continuous), дискретными (Discrete) и многомерными (Multivariate) распределениями. Пакет также предоставляет некоторое количество статистических методов, которые рассматриваются в курсах статистики.
import scipy.stats as sps import numpy as np import ipywidgets as widgets import matplotlib.pyplot as plt %matplotlib inline
1. Работа с библиотекой scipy.stats .¶
- X(params).rvs(size=N) — генерация выборки размера $N$ (Random VariateS). Возвращает numpy.array ;
- X(params).cdf(x) — значение функции распределения в точке $x$ (Cumulative Distribution Function);
- X(params).logcdf(x) — значение логарифма функции распределения в точке $x$;
- X(params).ppf(q) — $q$-квантиль (Percent Point Function);
- X(params).mean() — математическое ожидание;
- X(params).median() — медиана ($1/2$-квантиль);
- X(params).var() — дисперсия (Variance);
- X(params).std() — стандартное отклонение = корень из дисперсии (Standard Deviation).
Кроме того для непрерывных распределений определены функции
- X(params).pdf(x) — значение плотности в точке $x$ (Probability Density Function);
- X(params).logpdf(x) — значение логарифма плотности в точке $x$.
- X(params).pmf(k) — значение дискретной плотности в точке $k$ (Probability Mass Function);
- X(params).logpdf(k) — значение логарифма дискретной плотности в точке $k$.
Все перечисленные выше методы применимы как к конкретному распределению X(params) , так и к самому классу X . Во втором случае параметры передаются в сам метод. Например, вызов X.rvs(size=N, params) эквивалентен X(params).rvs(size=N) . При работе с распределениями и случайными величинами рекомендуем использовать первый способ, посколько он больше согласуется с математическим синтаксисом теории вероятностей.
Параметры могут быть следующими:
- loc — параметр сдвига;
- scale — параметр масштаба;
- и другие параметры (например, $n$ и $p$ для биномиального).
Для примера сгенерируем выборку размера $N = 200$ из распределения $\mathcal(1, 9)$ и посчитаем некоторые статистики. В терминах выше описанных функций у нас $X$ = sps.norm , а params = ( loc=1, scale=3 ).
Примечание. Выборка — набор независимых одинаково распределенных случайных величин. Часто в разговорной речи выборку отождествляют с ее реализацией — значения случайных величин из выборки при «выпавшем» элементарном исходе.
sample = sps.norm(loc=1, scale=3).rvs(size=200) print('Первые 10 значений выборки:\n', sample[:10]) print('Выборочное среденее: %.3f' % sample.mean()) print('Выборочная дисперсия: %.3f' % sample.var())
Первые 10 значений выборки: [ 0.65179639 -0.66437884 0.61450407 -0.1828078 0.42271419 0.14424901 2.01547486 7.81094724 -1.35246891 -1.35574313] Выборочное среденее: 0.854 Выборочная дисперсия: 9.118
numpy.random.multivariate_normal#
Draw random samples from a multivariate normal distribution.
The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal distribution to higher dimensions. Such a distribution is specified by its mean and covariance matrix. These parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,” squared) of the one-dimensional normal distribution.
New code should use the multivariate_normal method of a Generator instance instead; please see the Quick Start .
Mean of the N-dimensional distribution.
cov 2-D array_like, of shape (N, N)
Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling.
size int or tuple of ints, optional
Given a shape of, for example, (m,n,k) , m*n*k samples are generated, and packed in an m-by-n-by-k arrangement. Because each sample is N-dimensional, the output shape is (m,n,k,N) . If no shape is specified, a single (N-D) sample is returned.
check_valid < ‘warn’, ‘raise’, ‘ignore’ >, optional
Behavior when the covariance matrix is not positive semidefinite.
tol float, optional
Tolerance when checking the singular values in covariance matrix. cov is cast to double before the check.
Returns : out ndarray
The drawn samples, of shape size, if that was provided. If not, the shape is (N,) .
In other words, each entry out[i,j. ] is an N-dimensional value drawn from the distribution.
which should be used for new code.
The mean is a coordinate in N-dimensional space, which represents the location where samples are most likely to be generated. This is analogous to the peak of the bell curve for the one-dimensional or univariate normal distribution.
Covariance indicates the level to which two variables vary together. From the multivariate normal distribution, we draw N-dimensional samples, \(X = [x_1, x_2, . x_N]\) . The covariance matrix element \(C_\) is the covariance of \(x_i\) and \(x_j\) . The element \(C_\) is the variance of \(x_i\) (i.e. its “spread”).
Instead of specifying the full covariance matrix, popular approximations include:
- Spherical covariance ( cov is a multiple of the identity matrix)
- Diagonal covariance ( cov has non-negative elements, and only on the diagonal)
This geometrical property can be seen in two dimensions by plotting generated data-points:
>>> mean = [0, 0] >>> cov = [[1, 0], [0, 100]] # diagonal covariance
Diagonal covariance means that points are oriented along x or y-axis:
>>> import matplotlib.pyplot as plt >>> x, y = np.random.multivariate_normal(mean, cov, 5000).T >>> plt.plot(x, y, 'x') >>> plt.axis('equal') >>> plt.show()
Note that the covariance matrix must be positive semidefinite (a.k.a. nonnegative-definite). Otherwise, the behavior of this method is undefined and backwards compatibility is not guaranteed.
Papoulis, A., “Probability, Random Variables, and Stochastic Processes,” 3rd ed., New York: McGraw-Hill, 1991.
Duda, R. O., Hart, P. E., and Stork, D. G., “Pattern Classification,” 2nd ed., New York: Wiley, 2001.
>>> mean = (1, 2) >>> cov = [[1, 0], [0, 1]] >>> x = np.random.multivariate_normal(mean, cov, (3, 3)) >>> x.shape (3, 3, 2)
Here we generate 800 samples from the bivariate normal distribution with mean [0, 0] and covariance matrix [[6, -3], [-3, 3.5]]. The expected variances of the first and second components of the sample are 6 and 3.5, respectively, and the expected correlation coefficient is -3/sqrt(6*3.5) ≈ -0.65465.
>>> cov = np.array([[6, -3], [-3, 3.5]]) >>> pts = np.random.multivariate_normal([0, 0], cov, size=800)
Check that the mean, covariance, and correlation coefficient of the sample are close to the expected values:
>>> pts.mean(axis=0) array([ 0.0326911 , -0.01280782]) # may vary >>> np.cov(pts.T) array([[ 5.96202397, -2.85602287], [-2.85602287, 3.47613949]]) # may vary >>> np.corrcoef(pts.T)[0, 1] -0.6273591314603949 # may vary
We can visualize this data with a scatter plot. The orientation of the point cloud illustrates the negative correlation of the components of this sample.
>>> import matplotlib.pyplot as plt >>> plt.plot(pts[:, 0], pts[:, 1], '.', alpha=0.5) >>> plt.axis('equal') >>> plt.grid() >>> plt.show()
scipy.stats.multivariate_normal#
The mean keyword specifies the mean. The cov keyword specifies the covariance matrix.
Parameters : mean array_like, default: [0]
cov array_like or Covariance , default: [1]
Symmetric positive (semi)definite covariance matrix of the distribution.
allow_singular bool, default: False
Whether to allow a singular covariance matrix. This is ignored if cov is a Covariance object.
Used for drawing random variates. If seed is None, the RandomState singleton is used. If seed is an int, a new RandomState instance is used, seeded with seed. If seed is already a RandomState or Generator instance, then that object is used. Default is None.
Setting the parameter mean to None is equivalent to having mean be the zero-vector. The parameter cov can be a scalar, in which case the covariance matrix is the identity times that value, a vector of diagonal entries for the covariance matrix, a two-dimensional array_like, or a Covariance object.
The covariance matrix cov may be an instance of a subclass of Covariance , e.g. scipy.stats.CovViaPrecision. If so, allow_singular is ignored.
Otherwise, cov must be a symmetric positive semidefinite matrix when allow_singular is True; it must be (strictly) positive definite when allow_singular is False. Symmetry is not checked; only the lower triangular portion is used. The determinant and inverse of cov are computed as the pseudo-determinant and pseudo-inverse, respectively, so that cov does not need to have full rank.
The probability density function for multivariate_normal is
where \(\mu\) is the mean, \(\Sigma\) the covariance matrix, \(k\) the rank of \(\Sigma\) . In case of singular \(\Sigma\) , SciPy extends this definition according to [1].