- Data Cleaning Steps with Python and Pandas
- What is Data Cleaning? What is dirty Data?
- Setup
- Step 1: Exploratory data analysis in Python and Pandas
- Step 2: First rows as header read_csv in Pandas
- Step 3: Data tidying in Pandas
- Step 4: Correcting and replacing data in Pandas
- 4.1. Replace values in column — Pandas
- 4.2. Fill NaN with string or 0 — Pandas
- Step 5: Detect NaN values in column Pandas
- 5.1 Columns which contains only NaN values
- 5.2 Detect columns with NaN values
- Step 6: Drop columns in Pandas
- 6.1 Drop one column by name
- 6.2 Drop multiple columns by name
- 6.3 Drop columns with NaN values
- Step 7: Detect and drop duplicate rows in Pandas
- Step 8: Detect outliers in Pandas
- Step 9: Detect errors, typos and misspelling in Pandas
- Conclusion
Data Cleaning Steps with Python and Pandas
Often we may need to clean the data using Python and Pandas.
This tutorial explains the basic steps for data cleaning by example:
- Basic exploratory data analysis
- Detect and remove missing data
- Drop unnecessary columns and rows
- Detect outliers
- Inconsistent data
- Irrelevant features
What is Data Cleaning? What is dirty Data?
First let’s see what is dirty data:
dirty data is inaccurate, incomplete or inconsistent data
The common features of dirty data are:
- spelling or punctuation errors
- incorrect data associated with a field
- incomplete data
- outdated data
- duplicated records
The process of fixing all issues above is known as data cleaning or data cleansing.
Usually data cleaning process has several steps:
- normalization (optional)
- detect bad records
- correct problematic values
- remove irrelevant or inaccurate data
- generate report (optional)
At the end of the process data should be:
- complete
- up to date
- accurate
- correct
- consistent
- relevant
- normalized
Difference of Tidy data vs clean data? Data Tidying vs Data Cleaning? Data cleaning is related to data quality. Data tidying is related to data structure.
Setup
In this post we will use data from Kaggle — A Short History of the Data-science.
To read the data you need to use the following code:
import kaggle link = 'eswarankrishnasamy/2019-kaggle-machine-learning-data-science-survey' kaggle.api.authenticate() kaggle.api.dataset_download_file(link, file_name='multiple_choice_responses.csv', path='data/')
The downloaded data can be ready by:
import pandas as pd pd.read_csv('data/multiple_choice_responses.csv.zip', low_memory=False)
Time from Start to Finish (seconds) | Q1 | Q2 | Q2_OTHER_TEXT | Q3 |
---|---|---|---|---|
Duration (in seconds) | What is your age (# years)? | What is your gender? — Selected Choice | What is your gender? — Prefer to self-describe — Text | In which country do you currently reside? |
510 | 22-24 | Male | -1 | France |
423 | 40-44 | Male | -1 | India |
83 | 55-59 | Female | -1 | Germany |
391 | 40-44 | Male | -1 | Australia |
Step 1: Exploratory data analysis in Python and Pandas
To start we can do basic exploratory data analysis in Pandas. This will show us more about data:
The first method is head() — which returns the first 5 rows of the dataset.
To see the first 5 rows and 5 columns we can do: df.iloc[0:5,0:5]
The result is truncated for the first 5 columns:
Time from Start to Finish (seconds) | Q1 | Q2 | Q2_OTHER_TEXT | Q3 | |
---|---|---|---|---|---|
0 | Duration (in seconds) | What is your age (# years)? | What is your gender? — Selected Choice | What is your gender? — Prefer to self-describe — Text | In which country do you currently reside? |
1 | 510 | 22-24 | Male | -1 | France |
2 | 423 | 40-44 | Male | -1 | India |
3 | 83 | 55-59 | Female | -1 | Germany |
4 | 391 | 40-44 | Male | -1 | Australia |
Next we can see information about the number of the columns and rows by df.shape :
The result is a tuple showing 19718 rows and 246 columns:
Similar information we can get by df.info() :
RangeIndex: 19718 entries, 0 to 19717 Columns: 246 entries, Time from Start to Finish (seconds) to Q34_OTHER_TEXT dtypes: object(246) memory usage: 37.0+ MB
Finally we can get more details information about the data values by method describe() . This method will generate descriptive statistics (summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values).
Time from Start to Finish (seconds) | Q1 | Q2 | Q2_OTHER_TEXT | Q3 | |
---|---|---|---|---|---|
count | 19718 | 19718 | 19718 | 19718 | 19718 |
unique | 4169 | 12 | 5 | 46 | 60 |
top | 450 | 25-29 | Male | -1 | India |
freq | 42 | 4458 | 16138 | 19668 | 4786 |
Step 2: First rows as header read_csv in Pandas
So far we saw that the first row contains data which belongs to the header. We need to change how we read the data with header=[0,1] :
df = pd.read_csv('data/multiple_choice_responses.csv.zip', low_memory=False, header=[0,1])
In order to simplify the reading of the data we can drop single level from the multi-index by:
Step 3: Data tidying in Pandas
Next we can do data tidying because tidy data helps Pandas’s vectorized
operations.
For example column ‘Q1’ looks like — we need to use the multi-index in order to read the column:
df[('Q1', 'What is your age (# years)?')]
0 22-24 1 40-44 2 55-59 3 40-44 4 22-24
Can we split that into two columns? It looks like that all values are two numbers separated by ‘-‘ hyphen. The best is to confirm that observation by:
df[('Q1', 'What is your age (# years)?')].value_counts()
The last rows shows us one record — 70+ which needs special attention
45-49 949 50-54 692 55-59 422 60-69 338 70+ 100 dtype: int64
If we perform split operation on rows containing 70+ will result into:
0 70+ 1 None Name: 182, dtype: object
Step 4: Correcting and replacing data in Pandas
Next we can see how to correct the data above. We can do data correction of cases 70+ in two ways:
4.1. Replace values in column — Pandas
To replace the values in the column we can use method .str.replace(’70+’, ’70-120′, regex=False) as follows:
df['Q1'].str.replace('70+', '70-120', regex=False)
4.2. Fill NaN with string or 0 — Pandas
The other option is to fill the missing values after the split by:
we suppose that after the split we created new column ‘max_age’
Step 5: Detect NaN values in column Pandas
Now let’s see how we can detect NaN values. This will help us drop columns with NaN values.
5.1 Columns which contains only NaN values
To find columns which has only NaN values we can use two methods:
This will give a new Series with column name and True or False — depending on the NaN values. If a column has only NaN values we will get True.
To find columns which contain NaN values we can use:
There’s no column which contains only NaN values
5.2 Detect columns with NaN values
To detect columns which has NaN values we can use:
Time from Start to Finish (seconds) False Q1 False Q2 False Q2_OTHER_TEXT False Q3 False . Q34_Part_9 True Q34_Part_10 True Q34_Part_11 True Q34_Part_12 True Q34_OTHER_TEXT False Length: 246, dtype: bool
So columns like ‘Q34_Part_9’ have NaN values. Columns like ‘Q1’ don’t have NaN values.
Step 6: Drop columns in Pandas
Let’s say that we would like to drop columns based on name or NaN values. We can do that in several ways:
6.1 Drop one column by name
Parameters needed to drop columns are axis=1 and inplace=True — which means that operation will affect DataFrame.
df.drop('Q1', axis=1, inplace=True)
6.2 Drop multiple columns by name
We can list several column which to be removed by:
df.drop(['Q1', 'Q2'], axis=1, inplace=True)
6.3 Drop columns with NaN values
Finally we can drop columns which has NaN values:
We can use parameters like:
- how — ‘all’ or ‘any’
- subset — list of columns
- tresh — the number of NaN values required to remove the column
- inplace
Step 7: Detect and drop duplicate rows in Pandas
To detect duplicate values in the DataFrame we can use the method duplicated() . To detect duplicate rows in Pandas DataFrame we can use:
This results in 4 duplicated rows:
We can use parameter: keep
For example get indexes of all detected duplications:
df[df.duplicated(keep=False)].index
Int64Index([11228, 12344, 16413, 16547, 16653, 18705, 19258, 19705], dtype=’int64′)
Since we have 246 columns (answers) it’s pretty suspicious that there are full duplications.
We can use method df.drop_duplicates(subset=[‘Q1’]) in order to drop duplicated rows in Pandas:
df.drop_duplicates(subset=['Q1', 'Q2'])
Step 8: Detect outliers in Pandas
We can detect outliers in Pandas in many ways. Here we will cover basic detection of numeric data:
Suppose we work with column: ‘Time from Start to Finish (seconds)’
We can see the min, max and the percentiles by:
df['Time from Start to Finish (seconds)'].describe()
count 19717.000000 mean 14341.281027 std 74166.106601 min 23.000000 25% 340.000000 50% 540.000000 75% 930.000000 max 843612.000000 Name: (Time from Start to Finish (seconds), Duration (in seconds)), dtype: float64
So we have time for the survey from 23 up to 843612 seconds. Probably we can exclude some of them.
Another way to detect outliers is visually by plotting data like:
From the image above we can decide what is the threshold which makes sense for us.
Step 9: Detect errors, typos and misspelling in Pandas
Finally let’s check how we can detect typos and misspelled words in Pandas DataFrame. This will show how we can work with inconsistent or incomplete data.
For this purpose we are going to read file — ‘other_text_responses.csv’ which will be df_other . The reason is that it contains free text input.
Let’s read the third column of this DataFrame by:
df_other[df_other.columns[3]].value_counts().head(10)
Excel 865 Microsoft Excel 392 excel 263 MS Excel 67 Google Sheets 61 Google sheets 44 Microsoft excel 38 Excel 33 microsoft excel 27 EXCEL 25
We can see different variations of the same tool — Excel.
In order to detect similar values we will use Python library difflib :
import difflib difflib.get_close_matches('excl', ['Excel', 'Microsoft Excel ', 'MS Excel', 'excel'], n=1, cutoff=0.7)
The result of this will be:
So we can use Python in order to detect and fix misspelled words.
Code like the one below can help us create new column with corrected values:
import difflib correct_values = <> words = df_other["Q14_Part_3_TEXT"].value_counts(ascending=True).index for keyword in words: similar = difflib.get_close_matches(keyword, words, n=20, cutoff=0.6) for x in similar: correct_values[x] = keyword df_other["corr"] = df_other["Q14_Part_3_TEXT"].map(correct_values)
Conclusion
In this article, we learned what is clean data and how to do data cleaning in Pandas and Python.
Some topics which we discussed are NaN values, duplicates, drop columns and rows, outlier detection.
We saw all the steps of the data cleaning process with examples. We covered important topics like tidy data and data quality.
By using DataScientYst — Data Science Simplified, you agree to our Cookie Policy.