Python pandas dataframe удалить дубликаты

Drop Duplicates from a Pandas DataFrame

While working with data there can be situations where your dataframe has duplicate rows. Knowing how to remove such rows quickly can be quite handy. In this tutorial, we’ll look at how to drop duplicates from a pandas dataframe through some examples.

The drop_duplicates() function

The pandas dataframe drop_duplicates() function can be used to remove duplicate rows from a dataframe. It also gives you the flexibility to identify duplicates based on certain columns through the subset parameter. The following is its syntax:

📚 Discover Online Data Science Courses & Programs (Enroll for Free)

Introductory ⭐

Intermediate ⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

It returns a dataframe with the duplicate rows removed. It drops the duplicates except for the first occurrence by default. You can change this behavior through the parameter keep which takes in ‘first’ , ‘last’ , or False . To modify the dataframe in-place pass the argument inplace=True .

Читайте также:  Java что такое stack

Examples

Let’s look at some of the use-cases of the drop_duplicates() function through examples –

1. Drop duplicate rows based on all columns

By default, the drop_duplicates() function identifies the duplicates taking all the columns into consideration. It then, drops the duplicate rows and just keeps their first occurrence.

import pandas as pd # create a sample dataframe with duplicate rows data = < 'Pet': ['Cat', 'Dog', 'Dog', 'Dog', 'Cat'], 'Color': ['Brown', 'Golden', 'Golden', 'Golden', 'Black'], 'Eyes': ['Black', 'Black', 'Black', 'Brown', 'Green'] >df = pd.DataFrame(data) # print the dataframe print("The original dataframe:\n") print(df) # drop duplicates df_unique = df.drop_duplicates() print("\nAfter dropping duplicates:\n") print(df_unique)
The original dataframe: Pet Color Eyes 0 Cat Brown Black 1 Dog Golden Black 2 Dog Golden Black 3 Dog Golden Brown 4 Cat Black Green After dropping duplicates: Pet Color Eyes 0 Cat Brown Black 1 Dog Golden Black 3 Dog Golden Brown 4 Cat Black Green

In the above example, you can see that the rows with index 1 and 2 have the same values for all the three columns. On applying the drop_duplicates() function, the first row is retained and the remaining duplicate rows are dropped. As a result, the dataframe returned does not have a continuous index. If you want the returned dataframe to have a continuous index pass ignore_index=True to the drop_duplicates() function or reset the index of the returned dataframe.

2. Drop duplicate rows based on certain columns

You can also instruct the drop_duplicates() function to identify the duplicates based on only certain columns by passing them as a list to the subset argument.

import pandas as pd # create a sample dataframe with duplicate rows data = < 'Pet': ['Cat', 'Dog', 'Dog', 'Dog', 'Cat'], 'Color': ['Brown', 'Golden', 'Golden', 'Golden', 'Black'], 'Eyes': ['Black', 'Black', 'Black', 'Brown', 'Green'] >df = pd.DataFrame(data) # print the dataframe print("The original dataframe:\n") print(df) # drop duplicates df_unique = df.drop_duplicates(subset=['Pet', 'Color']) print("\nAfter dropping duplicates:\n") print(df_unique)
The original dataframe: Pet Color Eyes 0 Cat Brown Black 1 Dog Golden Black 2 Dog Golden Black 3 Dog Golden Brown 4 Cat Black Green After dropping duplicates: Pet Color Eyes 0 Cat Brown Black 1 Dog Golden Black 4 Cat Black Green

In the above example, we identify the duplicates based on just the columns Pet and Color by passing them as a list to the drop_duplicates() function. With this criteria, rows with index 1, 2, and 3 are now duplicates with the returned dataframe only retaining the first row.

3. Remove duplicates and retain the last occurrence

If you want to retain the last duplicate row instead of the first one pass keep=’last’ to the drop_duplicates() function.

import pandas as pd # create a sample dataframe with duplicate rows data = < 'Pet': ['Cat', 'Dog', 'Dog', 'Dog', 'Cat'], 'Color': ['Brown', 'Golden', 'Golden', 'Golden', 'Black'], 'Eyes': ['Black', 'Black', 'Black', 'Brown', 'Green'] >df = pd.DataFrame(data) # print the dataframe print("The original dataframe:\n") print(df) # drop duplicates df_unique = df.drop_duplicates(keep='last') print("\nAfter dropping duplicates:\n") print(df_unique)
The original dataframe: Pet Color Eyes 0 Cat Brown Black 1 Dog Golden Black 2 Dog Golden Black 3 Dog Golden Brown 4 Cat Black Green After dropping duplicates: Pet Color Eyes 0 Cat Brown Black 2 Dog Golden Black 3 Dog Golden Brown 4 Cat Black Green

In the above example, we retain the last duplicate instead of the first one.

4. Remove duplicates and do not retain any occurrences

If you do not want to retain any of the duplicate rows pass keep=False to the drop_duplicates() function.

import pandas as pd # create a sample dataframe with duplicate rows data = < 'Pet': ['Cat', 'Dog', 'Dog', 'Dog', 'Cat'], 'Color': ['Brown', 'Golden', 'Golden', 'Golden', 'Black'], 'Eyes': ['Black', 'Black', 'Black', 'Brown', 'Green'] >df = pd.DataFrame(data) # print the dataframe print("The original dataframe:\n") print(df) # drop duplicates df_unique = df.drop_duplicates(keep=False) print("\nAfter dropping duplicates:\n") print(df_unique)
The original dataframe: Pet Color Eyes 0 Cat Brown Black 1 Dog Golden Black 2 Dog Golden Black 3 Dog Golden Brown 4 Cat Black Green After dropping duplicates: Pet Color Eyes 0 Cat Brown Black 3 Dog Golden Brown 4 Cat Black Green 

In the above example, none of the duplicates are retained.

For more on the pandas dataframe drop_duplicates() function refer to its official documentation.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5

More on Pandas DataFrames –

  • Pandas – Sort a DataFrame
  • Change Order of Columns of a Pandas DataFrame
  • Pandas DataFrame to a List in Python
  • Pandas – Count of Unique Values in Each Column
  • Pandas – Replace Values in a DataFrame
  • Pandas – Filter DataFrame for multiple conditions
  • Pandas – Random Sample of Rows
  • Pandas – Random Sample of Columns
  • Save Pandas DataFrame to a CSV file
  • Pandas – Save DataFrame to an Excel file
  • Create a Pandas DataFrame from Dictionary
  • Convert Pandas DataFrame to a Dictionary
  • Drop Duplicates from a Pandas DataFrame
  • Concat DataFrames in Pandas
  • Append Rows to a Pandas DataFrame
  • Compare Two DataFrames for Equality in Pandas
  • Get Column Names as List in Pandas DataFrame
  • Select One or More Columns in Pandas
  • Pandas – Rename Column Names
  • Pandas – Drop one or more Columns from a Dataframe
  • Pandas – Iterate over Rows of a Dataframe
  • How to Reset Index of a Pandas DataFrame?
  • Read CSV files using Pandas – With Examples
  • Apply a Function to a Pandas DataFrame

Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.

Author

Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects. View all posts

Data Science Parichay is an educational website offering easy-to-understand tutorials on topics in Data Science with the help of clear and fun examples.

Источник

pandas.DataFrame.drop_duplicates#

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters subset column label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep , default ‘first’

Determines which duplicates (if any) to keep.

  • ‘first’ : Drop duplicates except for the first occurrence.
  • ‘last’ : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

Whether to modify the DataFrame rather than creating a new one.

ignore_index bool, default False

If True , the resulting axis will be labeled 0, 1, …, n — 1.

Returns DataFrame or None

DataFrame with duplicates removed or None if inplace=True .

Count unique combinations of columns.

Consider dataset containing ramen rating.

>>> df = pd.DataFrame( . 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], . 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], . 'rating': [4, 4, 3.5, 15, 5] . >) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0 

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates() brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0 

To remove duplicates on specific column(s), use subset .

>>> df.drop_duplicates(subset=['brand']) brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 

To remove duplicates and keep last occurrences, use keep .

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last') brand style rating 1 Yum Yum cup 4.0 2 Indomie cup 3.5 4 Indomie pack 5.0 

Источник

Как удалить повторяющиеся строки в Pandas DataFrame

Самый простой способ удалить повторяющиеся строки в кадре данных pandas — использовать функцию drop_duplicates() , которая использует следующий синтаксис:

df.drop_duplicates (подмножество = нет, сохранить = «первый», inplace = ложь)

  • подмножество: какие столбцы следует учитывать для выявления дубликатов. По умолчанию все столбцы.
  • keep: Указывает, какие дубликаты (если они есть) нужно сохранить.
  • first: удалить все повторяющиеся строки, кроме первой.
  • last: удалить все повторяющиеся строки, кроме последней.
  • False : удалить все дубликаты.
  • inplace: указывает, следует ли удалить дубликаты на месте или вернуть копию DataFrame.

В этом руководстве представлено несколько примеров практического использования этой функции в следующем кадре данных:

import pandas as pd #create DataFrame df = pd.DataFrame() #display DataFrame print(df) team points assists 0 a 3 8 1 b 7 6 2 b 7 7 3 c 8 9 4 c 8 9 5 d 9 3 

Пример 1. Удаление дубликатов во всех столбцах

В следующем коде показано, как удалить строки с повторяющимися значениями во всех столбцах:

df.drop_duplicates () team points assists 0 a 3 8 1 b 7 6 2 b 7 7 3 c 8 9 5 d 9 3 

По умолчанию функция drop_duplicates() удаляет все дубликаты, кроме первого.

Однако мы могли бы использовать аргумент keep=False для полного удаления всех дубликатов:

df.drop_duplicates (keep= False ) team points assists 0 a 3 8 1 b 7 6 2 b 7 7 5 d 9 3 

Пример 2. Удаление дубликатов в определенных столбцах

В следующем коде показано, как удалить строки с повторяющимися значениями только в столбцах с названиями team и points :

df.drop_duplicates (subset=['team', 'points']) team points assists 0 a 3 8 1 b 7 6 3 c 8 9 5 d 9 3 

Источник

Оцените статью