Python pandas label encoding

Содержание

Categorical encoding using Label-Encoding and One-Hot-Encoder
Label Encoding
Label Encoding in Python
One-Hot Encoder
One-Hot Encoding in Python
Conclusion
sklearn.preprocessing .LabelEncoder¶

Categorical encoding using Label-Encoding and One-Hot-Encoder

In many Machine-learning or Data Science activities, the data set might contain text or categorical values (basically non-numerical values). For example, color feature having values like red, orange, blue, white etc. Meal plan having values like breakfast, lunch, snacks, dinner, tea etc. Few algorithms such as CATBOAST, decision-trees can handle categorical values very well but most of the algorithms expect numerical values to achieve state-of-the-art results.

Over your learning curve in AI and Machine Learning, one thing you would notice that most of the algorithms work better with numerical inputs. Therefore, the main challenge faced by an analyst is to convert text/categorical data into numerical data and still make an algorithm/model to make sense out of it. Neural networks, which is a base of deep-learning, expects input values to be numerical.

There are many ways to convert categorical values into numerical values. Each approach has its own trade-offs and impact on the feature set. Hereby, I would focus on 2 main methods: One-Hot-Encoding and Label-Encoder. Both of these encoders are part of SciKit-learn library (one of the most widely used Python library) and are used to convert text or categorical data into numerical data which the model expects and perform better with.

Code snippets in this article would be of Python since I am more comfortable with Python. If you need for R (another widely used Machine-Learning language) then say so in comments.

Label Encoding

This approach is very simple and it involves converting each value in a column to a number. Consider a dataset of bridges having a column names bridge-types having below values. Though there will be many more columns in the dataset, to understand label-encoding, we will focus on one categorical column only.

BRIDGE-TYPE
Arch
Beam
Truss
Cantilever
Tied Arch
Suspension
Cable

We choose to encode the text values by putting a running sequence for each text values like below:

With this, we completed the label-encoding of variable bridge-type. That’s all label encoding is about. But depending upon the data values and type of data, label encoding induces a new problem since it uses number sequencing. The problem using the number is that they introduce relation/comparison between them. Apparently, there is no relation between various bridge type, but when looking at the number, one might think that ‘Cable’ bridge type has higher precedence over ‘Arch’ bridge type. The algorithm might misunderstand that data has some kind of hierarchy/order 0 < 1 < 2 … < 6 and might give 6X more weight to ‘Cable’ in calculation then than ‘Arch’ bridge type.

Let’s consider another column named ‘Safety Level’. Performing label encoding of this column also induces order/precedence in number, but in the right way. Here the numerical order does not look out-of-box and it makes sense if the algorithm interprets safety order 0 < 1 < 2 < 3 < 4 i.e. none < low < medium < high < very high.

Label Encoding in Python

Using category codes approach:

This approach requires the category column to be of ‘category’ datatype. By default, a non-numerical column is of ‘object’ type. So you might have to change type to ‘category’ before using this approach.

# import required libraries
import pandas as pd
import numpy as np# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])# converting type of columns to 'category'
bridge_df['Bridge_Types'] = bridge_df['Bridge_Types'].astype('category')# Assigning numerical values and storing in another column
bridge_df['Bridge_Types_Cat'] = bridge_df['Bridge_Types'].cat.codes
bridge_df

Using sci-kit learn library approach:

Another common approach which many data analyst perform label-encoding is by using SciKit learn library.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])# creating instance of labelencoder
labelencoder = LabelEncoder()# Assigning numerical values and storing in another column
bridge_df['Bridge_Types_Cat'] = labelencoder.fit_transform(bridge_df['Bridge_Types'])
bridge_df

One-Hot Encoder

Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column. Let’s consider the previous example of bridge type and safety levels with one-hot encoding.

Above are the one-hot encoded values of categorical column ‘Bridge-Type’. In the same way, let’s check for ‘Safety-Level’ column.

Rows which have the first column value (Arch/None) will have ‘1’ (indicating true) and other value’s columns will have ‘0’ (indicating false). Similarly, for other rows matching value with column value.

Though this approach eliminates the hierarchy/order issues but does have the downside of adding more columns to the data set. It can cause the number of columns to expand greatly if you have many unique values in a category column. In the above example, it was manageable, but it will get really challenging to manage when encoding gives many columns.

One-Hot Encoding in Python

Using sci-kit learn library approach:

OneHotEncoder from SciKit library only takes numerical categorical values, hence any value of string type should be label encoded before one hot encoded. So taking the dataframe from the previous example, we will apply OneHotEncoder on column Bridge_Types_Cat.

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder# creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')# passing bridge-types-cat column (label encoded values of bridge_types)
enc_df = pd.DataFrame(enc.fit_transform(bridge_df[['Bridge_Types_Cat']]).toarray())# merge with main df bridge_df on key values
bridge_df = bridge_df.join(enc_df)
bridge_df

Columns ‘Bridge_Types_Cat’ can be dropped from the dataframe.

Using dummies values approach:

This approach is more flexible because it allows encoding as many category columns as you would like and choose how to label the columns using a prefix. Proper naming will make the rest of the analysis just a little bit easier.

import pandas as pd
import numpy as np# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])# generate binary values using get_dummies
dum_df = pd.get_dummies(bridge_df, columns=["Bridge_Types"], prefix=["Type_is"] )# merge with main df bridge_df on key values
bridge_df = bridge_df.join(dum_df)
bridge_df

Conclusion

It is important to understand various option for encoding categorical variables because each approach has its own pros and cons. In data science, it is an important step, so I really encourage you to keep these ideas in mind when dealing with categorical variables. For any suggestion or for more details on the code used in this article, feel free to comment.

Источник

sklearn.preprocessing .LabelEncoder¶

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y , and not the input X .

Holds the label for each class.

Encode categorical features using an ordinal encoding scheme.

Encode categorical features as a one-hot numeric array.

LabelEncoder can be used to normalize labels.

>>> from sklearn import preprocessing >>> le = preprocessing.LabelEncoder() >>> le.fit([1, 2, 2, 6]) LabelEncoder() >>> le.classes_ array([1, 2, 6]) >>> le.transform([1, 1, 2, 6]) array([0, 0, 1, 2]. ) >>> le.inverse_transform([0, 0, 1, 2]) array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

>>> le = preprocessing.LabelEncoder() >>> le.fit(["paris", "paris", "tokyo", "amsterdam"]) LabelEncoder() >>> list(le.classes_) ['amsterdam', 'paris', 'tokyo'] >>> le.transform(["tokyo", "tokyo", "paris"]) array([2, 2, 1]. ) >>> list(le.inverse_transform([2, 2, 1])) ['tokyo', 'tokyo', 'paris']

Fit label encoder and return encoded labels.

Get metadata routing of this object.

Get parameters for this estimator.

Transform labels back to original encoding.

Set the parameters of this estimator.

Transform labels to normalized encoding.

Parameters : y array-like of shape (n_samples,)

Returns : self returns an instance of self.

Fit label encoder and return encoded labels.

Parameters : y array-like of shape (n_samples,)

Returns : y array-like of shape (n_samples,)

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns : routing MetadataRequest

A MetadataRequest encapsulating routing information.

Get parameters for this estimator.

Parameters : deep bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns : params dict

Parameter names mapped to their values.

Transform labels back to original encoding.

Parameters : y ndarray of shape (n_samples,)

Returns : y ndarray of shape (n_samples,)

See Introducing the set_output API for an example on how to use the API.

Parameters : transform , default=None

Configure output of transform and fit_transform .

«default» : Default output format of a transformer
«pandas» : DataFrame output
None : Transform configuration is unchanged

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline ). The latter have parameters of the form __ so that it’s possible to update each component of a nested object.

Parameters : **params dict

Returns : self estimator instance

Transform labels to normalized encoding.

Parameters : y array-like of shape (n_samples,)

Returns : y array-like of shape (n_samples,)

Labels as normalized encodings.

Источник