Outlier detection in python

How to Perform Outlier Detection in Python for Machine Learning: Part 1

We live on an outlier. Earth is the only hump of rock with life in the Milky Way galaxy. Other planets in our galaxy are inliers or normal data points in a so-called database of stars and planets.

There are many definitions of outliers. In simple terms, we define outliers as data points that are significantly different than the majority in a dataset. Outliers are the rare, extreme samples that don’t conform or align with the inliers in a dataset.

Statistically speaking, outliers come from a different distribution than the rest of the samples in a feature. They present statistically significant abnormalities.

These definitions depend on what we consider «normal». For example, it is perfectly normal for CEOs to make millions of dollars, but if we add their salary information to a dataset of household incomes, they become abnormal.

Outlier detection is the field of statistics and machine learning that uses various techniques and algorithms to detect such extreme samples.

Читайте также:  Java int array type

Check out the second part of the series here:

How to Perform Univariate Outlier Detection in Python for Machine Learning

Univariate outlier detection, clearly explained

Why bother with outlier detection?

But why, though? Why do we need to find them? What’s the harm in them? Well, consider this distribution of 12 numbers ranging from 50 to 100. One of the data points is 2534, which is clearly an outlier.

import numpy as np

array = [97, 87, 95, 62, 53, 66, 2534, 60, 68, 90, 52, 63, 65]
array
[97, 87, 95, 62, 53, 66, 2534, 60, 68, 90, 52, 63, 65]

Mean and standard deviation are two of the most heavily-used and critical attributes of a distribution, so we must feed realistic values of these two metrics when fitting machine learning models.

Источник

Detect and Remove the Outliers using Python

An Outlier is a data-item/object that deviates significantly from the rest of the (so-called normal)objects. They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect the outliers, and the removal process is the data frame same as removing a data item from the panda’s data frame.

Here pandas data frame is used for a more realistic approach as in real-world projects need to detect the outliers arouse during the data analysis step, the same approach can be used on lists and series-type objects.

Dataset Used For Outlier Detection

The dataset used in this article is the Diabetes dataset and it is preloaded in the sklearn library.

Python3

first five rows of the dataset

first five rows of the dataset

Outliers can be detected using visualization, implementing mathematical formulas on the dataset, or using the statistical approach. All of these are discussed below.

Outliers Visualization

Visualizing Outliers Using Box Plot

It captures the summary of the data effectively and efficiently with only a simple box and whiskers. Boxplot summarizes sample data using 25th, 50th, and 75th percentiles. One can just get insights(quartiles, median, and outliers) into the dataset by just looking at its boxplot.

Python3

Outliers present in the bmi columns

Outliers present in the bmi columns

In the above graph, can clearly see that values above 10 are acting as outliers.

Python3

(array([ 32, 145, 256, 262, 366, 367, 405]),)

Visualizing Outliers Using ScatterPlot.

It is used when you have paired numerical data and when your dependent variable has multiple values for each reading independent variable, or when trying to determine the relationship between the two variables. In the process of utilizing the scatter plot, one can also use it for outlier detection.

To plot the scatter plot one requires two variables that are somehow related to each other. So here, ‘Proportion of non-retail business acres per town’ and ‘Full-value property-tax rate per $10,000’ are used whose column names are “INDUS” and “TAX” respectively.

Python3

Scatter plot of bp and bmi

Looking at the graph can summarize that most of the data points are in the bottom left corner of the graph but there are few points that are exactly;y opposite that is the top right corner of the graph. Those points in the top right corner can be regarded as Outliers.

Using approximation can say all those data points that are x>20 and y>600 are outliers. The following code can fetch the exact position of all those points that satisfy these conditions.

Outliers in BMI and BP Column Combined

Python3

(array([ 32, 145, 256, 262, 366, 367, 405]),)

Z-score

Z- Score is also called a standard score. This value/score helps to understand that how far is the data point from the mean. And after setting up a threshold value one can utilize z score values of data points to define the outliers.

Zscore = (data_point -mean) / std. deviation

Python3

[0.80050009 0.03956713 1.79330681 1.87244107 0.11317236 1.94881082 0.9560041 1.33508832 0.87686984 1.49059233 2.02518057 0.57139085 0.34228161 0.11317236 0.95323959 1.1087436 0.11593688 1.48782782 0.80326461 0.57415536 1.03237385 1.79607132 1.79607132 0.95323959 1.33785284 1.41422259 2.25428981 0.49778562 1.10597908 1.41145807 1.26148309 0.49778562 0.72413034 0.6477606 0.34228161 1.02960933 0.26591186 0.19230663 0.03956713 0.03956713 0.11317236 2.10155031 1.26148309 0.41865135 0.95323959 0.57139085 1.18511334 1.64333183 1.41145807 0.87963435 0.72413034 1.25871858 1.1087436 0.19230663 1.03237385 0.87963435 0.87963435 0.57415536 0.87686984 1.33508832 1.49059233 0.87963435 0.57415536 0.72689486 1.41145807 0.9560041 0.19230663 0.87686984 0.80050009 0.34228161 0.03956713 0.03956713 1.33508832 0.26591186 0.26591186 0.19230663 0.65052511 2.02518057 0.11317236 2.17792006 1.48782782 0.26591186 0.34504612 0.80326461 0.03680262 0.95323959 1.49059233 0.95323959 1.1087436 0.9560041 0.26591186 0.95323959 0.42141587 1.03237385 1.64333183 1.49059233 1.18234883 0.57415536 0.03680262 0.03956713 0.34228161 0.34228161]

The above output is just a snapshot of part of the data; the actual length of the list(z) is 506 that is the number of rows. It prints the z-score values of each data item of the column

Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).

Rows where Z value is greater than 2

Источник

Detecting And Treating Outliers In Python — Part 1

An Explorative Data Analysis (EDA) is crucial when working on data science projects. Knowing your data inside and out can simplify decision making concerning the selection of features, algorithms, and hyperparameters. One essential part of the EDA is the detection of outliers. Simply said, outliers are observations that are far away from the other data points in a random sample of a population.

But why can outliers cause problems?

Because in data science, we often want to make assumptions about a specific population. Extreme values, however, can have a significant impact on conclusions drawn from data or machine learning models. With outlier detection and treatment, anomalous observations are viewed as part of different populations to ensure stable findings for the population of interest.

When identified, outliers may reveal unexpected knowledge about a population, which also justifies their special handling during EDA.

Moreover, inaccuracies in data collection and processing can create so-called error-outliers. These measurements often do not belong to the population we are interested in and therefore need treatment.

Different types of outliers

One must distinguish between univariate and multivariate outliers. Univariate outliers are extreme values in the distribution of a specific variable, whereas multivariate outliers are a combination of values in an observation that is unlikely. For example, a univariate outlier could be a human age measurement of 120 years or a temperature measurement in Antarctica of 50 degrees Celsius.

A multivariate outlier could be an observation of a human with a height measurement of 2 meters (in the 95th percentile) and a weight measurement of 50kg (in the 5th percentile). Both types of outliers can affect the outcome of an analysis but are detected and treated differently.

Tutorial on univariate outliers using Python

This first post will deal with the detection of univariate outliers, followed by a second article on multivariate outliers. In a third article, I will write about how outliers of both types can be treated.

Outliers can be discovered in various ways, including statistical methods, proximity-based methods, or supervised outlier detection. In this article series, I will solely focus on commonly used statistical methods.

I will use the Boston housing data set for illustration and provide example code in Python (3), so you can easily follow along. The Boston housing data set is part of the sklearn library.

It appears there are three variables, precisely AGE, INDUS, and RAD, with no univariate outlier observations. The remaining variables all have data points beyond their whiskers.

Let’s look closer into the variable ‘CRIM’, which encodes the crime rate per capita by town. The individual box plot below shows that the crime rate in most towns is below 5%.

Box plots are great to summarize and visualize the distribution of variables easily and quickly. However, they do not identify the actual indexes of the outlying observations. In the following, I will discuss three quantitative methods commonly used in statistics for the detection of univariate outliers:

  • Tukey’s box plot method
  • Internally studentized residuals (AKA z-score method)
  • Median Absolute Deviation method

Tukey’s box plot method

Next to its visual benefits, the box plot provides useful statistics to identify individual observations as outliers. Tukey distinguishes between possible and probable outliers. A possible outlier is located between the inner and the outer fence, whereas a probable outlier is located outside the outer fence.

While the inner (often confused with the whiskers) and outer fence are usually not shown on the actual box plot, they can be calculated using the interquartile range (IQR) like this:

IQR =Q3 — Q1, whereas q3 := 75th quartile and q1 := 25th quartile

Inner fence = [Q1-1.5*IQR, Q3+1.5*IQR]

Outer fence = [Q1–3*IQR, Q3+3*IQR]

The distribution’s inner fence is defined as 1.5 x IQR below Q1, and 1.5 x IQR above Q3. The outer fence is defined as 3 x IQR below Q1, and 3 x IQR above Q3. Following Tukey, only the probable outliers are treated, which lie outside the outer fence. For the underlying example, this means:

Following a common rule of thumb, if z > C, where C is usually set to 3, the observation is marked as an outlier. This rule stems from the fact that if a variable is normally distributed, 99.7% of all data points are located 3 standard deviations around the mean. Let’s see on our example, which observations of ‘CRIM’ are detected to be outliers using the z-score:

Nevertheless, the externally studentized residuals have limitations as the mean and standard deviations are still sensitive to other outliers and still expect the variable of interest X to be normally distributed.

Median Absolute Deviation method

The median absolute deviation method (MAD) replaces the mean and standard deviation with more robust statistics, like the median and median absolute deviation. The median absolute deviation is defined as:

The test statistic is calculated like the z-score using robust statistics. Also, to identify outlying observations, the same cut-off point of 3 is used. If the test statistic lies above 3, it is marked as an outlier. Compared to the internally (z-score) and externally studentized residuals, this method is more robust to outliers and does assume X to be parametrically distributed (Examples of discrete and continuous parametric distributions).

Let’s see how many outliers are detected for variable ‘CRIM’ using the MAD method.

We can see that the MAD method detects 165 outliers for the crime rate per capita by town and with that the most outliers of all methods.

Wrapping up

There are different ways to detect univariate outliers, each one coming with advantages and disadvantages. The z-score needs to be applied critically due to its sensitivity to mean and standard deviation and its assumption of a normally distributed variable. The MAD method is often used instead and serves as a more robust alternative. Tukey’s box plot method offers robust results and can be easily extended when the data is highly skewed.

To decide on the right approach for your own data set, closely examine your variables’ distribution, and use your domain knowledge.

In the next posting, I will address the detection of multivariate outliers.

Источник

Оцените статью