- How to get odds-ratios and other related features with scikit-learn in Python?
- Method 1: Using the StatsModels Library
- Step 1: Import Required Libraries
- Step 2: Load Data
- Step 3: Prepare Data
- Step 4: Fit Logistic Regression Model
- Step 5: Get Odds-Ratios and Other Related Features
- Step 6: Print Results
- Method 2: Using the ResearchPy Library
- Step 1: Install ResearchPy Library
- Step 2: Import Required Libraries
- Step 3: Load Data
- Step 4: Create Crosstab and Perform Chi-Square Test
- Step 5: Obtain Odds Ratio
- Step 6: Obtain Other Related Features
- Method 3: Using the Pandas Crosstab Function
- scipy.stats.contingency.odds_ratio#
- Interpretation of Odds Ratio and Fisher’s Exact Test
- Calculation of the Odds Ratio and Applying Fisher’s Test on Python
- Example Case
How to get odds-ratios and other related features with scikit-learn in Python?
Odds ratio is a statistical measure used in epidemiology and medical statistics to describe the relationship between the presence or absence of an attribute and the odds of a particular outcome. In Python, you can use scikit-learn library to calculate odds ratio and other related features, such as contingency tables, confidence intervals, and p-values. Here are some methods to get odds ratios and related features in scikit-learn:
Method 1: Using the StatsModels Library
To get odds-ratios and other related features with Scikit-learn using the StatsModels library, you can follow these steps:
Step 1: Import Required Libraries
import pandas as pd import statsmodels.api as sm from sklearn.linear_model import LogisticRegression
Step 2: Load Data
Step 3: Prepare Data
X = data[['feature1', 'feature2', 'feature3']] y = data['target']
Step 4: Fit Logistic Regression Model
logit_model = sm.Logit(y, X) result = logit_model.fit()
Step 5: Get Odds-Ratios and Other Related Features
coefficients = result.params odds_ratio = np.exp(coefficients) p_values = result.pvalues
Step 6: Print Results
print("Coefficients:\n", coefficients) print("\nOdds Ratios:\n", odds_ratio) print("\nP-Values:\n", p_values)
The above code will fit a logistic regression model using the StatsModels library and get the odds-ratios and other related features. The coefficients variable will contain the coefficients of the model, odds_ratio will contain the odds-ratios, and p_values will contain the p-values of the model.
Note: make sure to replace data.csv with the name of your data file.
Method 2: Using the ResearchPy Library
ResearchPy is a Python library that simplifies the process of doing statistics in Python. It provides a simple interface to perform statistical tests, such as t-tests, ANOVA, and correlation analysis. In this tutorial, we will learn how to use ResearchPy library to obtain odds ratios and other related features.
Step 1: Install ResearchPy Library
Before we can start using ResearchPy library, we need to install it. We can install it using pip command:
Step 2: Import Required Libraries
Next, we need to import the required libraries. We will be using pandas, numpy, and researchpy libraries for this tutorial.
import pandas as pd import numpy as np import researchpy as rp
Step 3: Load Data
For this tutorial, we will be using the famous Titanic dataset. We will load the data using pandas library.
url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv' df = pd.read_csv(url)
Step 4: Create Crosstab and Perform Chi-Square Test
We will create a crosstab between two categorical variables, ‘Sex’ and ‘Survived’, and perform a chi-square test to check if there is any significant association between the two variables.
cross_tab = pd.crosstab(df['Sex'], df['Survived']) chi_square = rp.chi_square(cross_tab) chi_square
The output of the above code will show the chi-square test results.
Step 5: Obtain Odds Ratio
We can obtain the odds ratio by using the following code:
odds_ratio = rp.crosstab(cross_tab, test='chi-square').results['OR'] odds_ratio
The output of the above code will show the odds ratio.
Step 6: Obtain Other Related Features
We can obtain other related features such as confidence interval, p-value, and degrees of freedom by using the following code:
ci = rp.crosstab(cross_tab, test='chi-square').results['CI (2.5%)'], rp.crosstab(cross_tab, test='chi-square').results['CI (97.5%)'] p_value = rp.crosstab(cross_tab, test='chi-square').results['p-value'] dof = rp.crosstab(cross_tab, test='chi-square').results['df'] print("Confidence Interval: ", ci) print("P-Value: ", p_value) print("Degrees of Freedom: ", dof)
The output of the above code will show the confidence interval, p-value, and degrees of freedom.
In conclusion, we can use ResearchPy library to obtain odds ratios and other related features with ease. By following the above steps, we can perform statistical tests and obtain useful information from our data.
Method 3: Using the Pandas Crosstab Function
To get odds-ratios and other related features with scikit-learn using the Pandas Crosstab function, you can follow these steps:
import pandas as pd from sklearn.feature_selection import chi2
data = pd.read_csv('your_data.csv')
cross_tab = pd.crosstab(data['feature1'], data['feature2'])
- Use the chi2 function from scikit-learn to calculate the chi-squared statistic and p-values for each feature:
chi2_stat, p_values = chi2(cross_tab, data['target'])
odds_ratios = np.exp(chi2_stat)
print('Odds ratios:', odds_ratios) print('P-values:', p_values)
Here is the complete code:
import pandas as pd from sklearn.feature_selection import chi2 import numpy as np data = pd.read_csv('your_data.csv') cross_tab = pd.crosstab(data['feature1'], data['feature2']) chi2_stat, p_values = chi2(cross_tab, data['target']) odds_ratios = np.exp(chi2_stat) print('Odds ratios:', odds_ratios) print('P-values:', p_values)
This should output the odds-ratios and p-values for the features you analyzed.
scipy.stats.contingency.odds_ratio#
A 2×2 contingency table. Elements must be non-negative integers.
kind str, optional
Which kind of odds ratio to compute, either the sample odds ratio ( kind=’sample’ ) or the conditional odds ratio ( kind=’conditional’ ). Default is ‘conditional’ .
Returns : result OddsRatioResult instance
The returned object has two computed attributes:
- If kind is ‘sample’ , this is sample (or unconditional) estimate, given by table[0, 0]*table[1, 1]/(table[0, 1]*table[1, 0]) .
- If kind is ‘conditional’ , this is the conditional maximum likelihood estimate for the odds ratio. It is the noncentrality parameter of Fisher’s noncentral hypergeometric distribution with the same hypergeometric parameters as table and whose mean is table[0, 0] .
The object has the method confidence_interval that computes the confidence interval of the odds ratio.
The conditional odds ratio was discussed by Fisher (see “Example 1” of [1]). Texts that cover the odds ratio include [2] and [3].
R. A. Fisher (1935), The logic of inductive inference, Journal of the Royal Statistical Society, Vol. 98, No. 1, pp. 39-82.
Breslow NE, Day NE (1980). Statistical methods in cancer research. Volume I — The analysis of case-control studies. IARC Sci Publ. (32):5-338. PMID: 7216345. (See section 4.2.)
H. Sahai and A. Khurshid (1996), Statistics in Epidemiology: Methods, Techniques, and Applications, CRC Press LLC, Boca Raton, Florida.
Berger, Jeffrey S. et al. “Aspirin for the Primary Prevention of Cardiovascular Events in Women and Men: A Sex-Specific Meta-analysis of Randomized Controlled Trials.” JAMA, 295(3):306-313, DOI:10.1001/jama.295.3.306, 2006.
In epidemiology, individuals are classified as “exposed” or “unexposed” to some factor or treatment. If the occurrence of some illness is under study, those who have the illness are often classifed as “cases”, and those without it are “noncases”. The counts of the occurrences of these classes gives a contingency table:
exposed unexposed cases a b noncases c d
The sample odds ratio may be written (a/c) / (b/d) . a/c can be interpreted as the odds of a case occurring in the exposed group, and b/d as the odds of a case occurring in the unexposed group. The sample odds ratio is the ratio of these odds. If the odds ratio is greater than 1, it suggests that there is a positive association between being exposed and being a case.
Interchanging the rows or columns of the contingency table inverts the odds ratio, so it is import to understand the meaning of labels given to the rows and columns of the table when interpreting the odds ratio.
In [4], the use of aspirin to prevent cardiovascular events in women and men was investigated. The study notably concluded:
…aspirin therapy reduced the risk of a composite of cardiovascular events due to its effect on reducing the risk of ischemic stroke in women […]
The article lists studies of various cardiovascular events. Let’s focus on the ischemic stoke in women.
The following table summarizes the results of the experiment in which participants took aspirin or a placebo on a regular basis for several years. Cases of ischemic stroke were recorded:
Aspirin Control/Placebo Ischemic stroke 176 230 No stroke 21035 21018
The question we ask is “Is there evidence that the aspirin reduces the risk of ischemic stroke?”
>>> from scipy.stats.contingency import odds_ratio >>> res = odds_ratio([[176, 230], [21035, 21018]]) >>> res.statistic 0.7646037659999126
For this sample, the odds of getting an ischemic stroke for those who have been taking aspirin are 0.76 times that of those who have received the placebo.
To make statistical inferences about the population under study, we can compute the 95% confidence interval for the odds ratio:
>>> res.confidence_interval(confidence_level=0.95) ConfidenceInterval(low=0.6241234078749812, high=0.9354102892100372)
The 95% confidence interval for the conditional odds ratio is approximately (0.62, 0.94).
The fact that the entire 95% confidence interval falls below 1 supports the authors’ conclusion that the aspirin was associated with a statistically significant reduction in ischemic stroke.
Interpretation of Odds Ratio and Fisher’s Exact Test
Calculation of the Odds Ratio and Applying Fisher’s Test on Python
When we work on nominal types of data we mostly focus on frequency tables. There aren’t too many statistical methods to deduce the conclusions of what nominal data relay, unlike the numeric data. Methods such as correlations, confidence intervals, mean, median, etc work for numeric data types. Therefore, frequency tables are used to interpret the nominal data. With the help of the frequency table, nominal data can be interpreted by considering the frequency values in that table.
After creating a frequency table we can have numeric data which we can apply statistical methods on. Chi-Square Goodness of Fit test and Test of Independence are mostly known methods to check whether frequencies are as expected or not according to observed values. These tests provide information about the distribution of nominal data. Besides, there is also one more metric that provides an overall score about the association of two nominal data. It is called the Odds ratio. The odds ratio mostly works on nominal variables that have exactly two levels. The statistical test called Fisher’s Exact for 2×2 tables tests whether the odds ratio is equal to 1 or not. It can also test whether the odds ratio is greater or less than 1.
In this article, I will explain what the odds ratio is, how to calculate it, and how to test whether it is going to be equal to 1 in population. We will see the following sections;
- Example Case
- Odds Ratio and Calculation
- Testing Odds Ratio with Fisher’s Exact Test in Python
Example Case
Suppose that you’re working on clinical data. You have two basic variables. One of them shows whether patients are above the average weight and the other one shows whether patients have health problems. Your purpose is to find whether there is any difference or association between being above the average weight and having health problems. So let’s assume that we did our experiment on 20 different patients and their results are found like the table below.