Groupby python с условием

Содержание

Pandas: как использовать Groupby и Count с условием
Пример: Groupby и Count с условием в Pandas
Дополнительные ресурсы
pandas groupby filter by column values and conditional aggregation
Create a test dataframe
Groupby continent and sum the GDP of countries who are G20 Member
Using pandas assign to filter the groupby columns and apply conditional sum
Pandas — Groupby with conditional formula
3 Answers 3

Pandas: как использовать Groupby и Count с условием

Вы можете использовать следующий базовый синтаксис для выполнения группировки и подсчета с условием в кадре данных pandas:

df.groupby('var1')['var2'].apply ( lambda x: (x=='val'). sum ()). reset_index(name='count')

Этот конкретный синтаксис группирует строки DataFrame на основе var1 , а затем подсчитывает количество строк, в которых var2 равно «val».

В следующем примере показано, как использовать этот синтаксис на практике.

Пример: Groupby и Count с условием в Pandas

Предположим, у нас есть следующий кадр данных pandas, который содержит информацию о различных баскетболистах:

import pandas as pd #create DataFrame df = pd.DataFrame() #view DataFrame print(df) team pos points 0 A Gu 18 1 A Fo 22 2 A Fo 19 3 A Fo 14 4 B Gu 14 5 B Gu 11 6 B Fo 20 7 B Fo 28

В следующем коде показано, как сгруппировать DataFrame по переменной team и подсчитать количество строк, в которых переменная pos равна ‘Gu’:

#groupby team and count number of 'pos' equal to 'Gu' df_count = df.groupby('team')['pos'].apply ( lambda x: (x=='Gu'). sum ()). reset_index(name='count') #view results print(df_count) team count 0 A 1 1 B 2

У команды А есть 1 строка, в которой столбец pos равен «Gu».
У команды Б есть 2 строки, в которых столбец pos равен «Gu».

Мы можем использовать аналогичный синтаксис для выполнения группировки и подсчета с некоторым числовым условием.

Например, следующий код показывает, как выполнить группировку по переменной team и подсчитать количество строк, в которых значение переменной points больше 15:

#groupby team and count number of 'points' greater than 15 df_count = df.groupby('team')['points'].apply ( lambda x: (x>15). sum ()). reset_index(name='count') #view results print(df_count) team count 0 A 3 1 B 2

У команды А есть 3 строки, в которых столбец очков больше 15.
У команды Б есть 2 строки, в которых столбец с очками больше 15.

Вы можете использовать аналогичный синтаксис для выполнения группового и подсчета с любым конкретным условием, которое вы хотите.

Дополнительные ресурсы

В следующих руководствах объясняется, как выполнять другие распространенные задачи в pandas:

Источник

pandas groupby filter by column values and conditional aggregation

In this post, we will learn how to filter column values in a pandas group by and apply conditional aggregations such as sum, count, average etc.

We will first create a dataframe of 4 columns , first column is continent, second is country and third & fourth column represents their GDP value in trillion and Member of G20 group respectively. These are fake numbers and doesn’t represent their real GDP worth.

Once this dataframe is created then we will group the countries in this dataframe that are in the same continent and apply conditions to determine the GDP sum of countries who are Member of G20 and who aren’t.

Create a test dataframe

Let’s create a dataframe with all the four columns: continent, country, GDP(trillion) and Member_G20

For the third column GDP(trillion), I’m using numpy randint function to create random numbers for all these countries and similarly for fourth column(Member_G20) random choice is used to randomly select from the list [‘Yes’, ‘No’]

import pandas as pd import numpy as np df = pd.DataFrame('continent' : ['Asia','NorthAmerica','NorthAmerica','Europe','Europe', 'Europe','Asia', 'Europe', 'Asia'], 'country' : ['China', 'USA', 'Canada', 'Poland', 'Romania', 'Italy', 'India', 'Germany', 'Russia'], 'GDP(trillion)' : np.random.randint(1, 9 , 9), 'Member_G20' : np.random.choice(['Y', 'N'], 9)>)

Groupby continent and sum the GDP of countries who are G20 Member

So we will first group by continent and then filter the rows in each group where a country is a G20 member

df.groupby(['continent']).apply(lambda x: x[x['Member_G20'] == 'Y' ]['GDP(trillion)'].sum())

continent Asia 19 Europe 5 NorthAmerica 5 dtype: int64

Let’s understand this by doing one step at a time:

First we group by continent using pandas groupby function

Next, we will select a group from this groupby result, we will choose Europe. we can see all the rows within the group Europe and there are 3 countries in Europe that are not a G20 member

selected_group = grp.get_group('Europe') selected_group

Now filter the rows by column Member_G20 and drop all countries that are not a G20 Member

selected_group[selected_group['Member_G20']=='Y']

 continent country GDP(trillion) Member_G20 4 Europe Romania 5 Y

At last, we will get the GDP(trillion) column for this filtered group and compute it’ sum

selected_group[selected_group['Member_G20']=='Y']['GDP(trillion)'].sum()

Using pandas assign to filter the groupby columns and apply conditional sum

We can use pandas assign, which adds a new column in the dataframe to filter it first by the column values and then apply pandas groupby and finally aggregate the values. Let’s see how it works

here we are using pandas assign to create a new column and update it by column value GDP(trillion) using numpy where() to filter the rows where country is a G20 member otherwise update with 0

df.assign(result = np.where(df['Member_G20']=='Y',df['GDP(trillion)'],0))\ .groupby('continent').agg('result':sum>)

Let’s take another example, if we want to sum up the GDP(trillion) value of the countries who are not a G20 member and also who are a G20 member. We will create two columns in this case and then apply groupby and aggregate(sum) values

df.assign( gdp_sum_non_member_g20 = np.where(df['Member_G20']=='N',df['GDP(trillion)'],0), gdp_sum_member_g20 = np.where(df['Member_G20']=='Y',df['GDP(trillion)'],0) ).groupby('continent').agg('gdp_sum_non_member_g20':sum, 'gdp_sum_member_g20':sum>)

Updated: January 7, 2022

Источник

Pandas — Groupby with conditional formula

Given the above dataframe, is there an elegant way to groupby with a condition? I want to split the data into two groups based on the following conditions:

(df['SibSp'] > 0) | (df['Parch'] > 0) = New Group -"Has Family" (df['SibSp'] == 0) & (df['Parch'] == 0) = New Group - "No Family"

 SurvivedMean Has Family Mean No Family Mean

Can it be done using groupby or would I have to append a new column using the above conditional statement?

Is your df coded in binary? If so, you may be able to use the pandas method get_dummies . Otherwise, yes, I would recommend/think you should create a new column (you would only need one I think) to perform the groupby on. I can help write some code if I have a better idea of what you’re doing! Also, given your desired output, it seems like you will need to pivot the db as well!

3 Answers 3

An easy way to group that is to use the sum of those two columns. If either of them is positive, the result will be greater than 1. And groupby accepts an arbitrary array as long as the length is the same as the DataFrame’s length so you don’t need to add a new column.

family = np.where((df['SibSp'] + df['Parch']) >= 1 , 'Has Family', 'No Family') df.groupby(family)['Survived'].mean() Out: Has Family 0.5 No Family 1.0 Name: Survived, dtype: float64

Use only one condition if never values in columns SibSp and Parch are less as 0 :

m1 = (df['SibSp'] > 0) | (df['Parch'] > 0) df = df.groupby(np.where(m1, 'Has Family', 'No Family'))['Survived'].mean() print (df) Has Family 0.5 No Family 1.0 Name: Survived, dtype: float64

If is impossible use first use both conditions:

m1 = (df['SibSp'] > 0) | (df['Parch'] > 0) m2 = (df['SibSp'] == 0) & (df['Parch'] == 0) a = np.where(m1, 'Has Family', np.where(m2, 'No Family', 'Not')) df = df.groupby(a)['Survived'].mean() print (df) Has Family 0.5 No Family 1.0 Name: Survived, dtype: float64

You could define your conditions in a list and use the function group_by_condition below to create a filtered list for each condition. Afterwards you can select the resulting items using pattern matching:

df = [ , , ] conditions = [ lambda x: (x['SibSp'] > 0) or (x['Parch'] > 0), # has family lambda x: (x['SibSp'] == 0) and (x['Parch'] == 0) # no family ] def group_by_condition(l, conditions): return [[item for item in l if condition(item)] for condition in conditions] [has_family, no_family] = group_by_condition(df, conditions)

Источник