Pandas python поиск подстроки в строке

Check For a Substring in a Pandas DataFrame Column

Looking for strings to cut down your dataset for analysis and machine learning

The Pandas library is a comprehensive tool not only for crunching numbers but also for working with text data.

For many data analysis applications and machine learning exploration/pre-processing, you’ll want to either filter out or extract information from text data. To do so, Pandas offers a wide range of in-built methods that you can use to add, remove, and edit text columns in your DataFrames.

In this piece, let’s take a look specifically at searching for substrings in a DataFrame column. This may come in handy when you need to create a new category based on existing data (for example during feature engineering before training a machine learning model).

If you want to follow along, download the dataset here.

import pandas as pddf = pd.read_csv('vgsales.csv')

NOTE: we’ll be using a lot of loc in this piece, so if you’re unfamiliar with that method, check out the first article linked at the very bottom of this piece.

Using “contains” to Find a Substring in a Pandas DataFrame

The contains method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not. A basic application of contains should look like Series.str.contains(«substring») . However, we can immediately take this to the next level with two additions:

  1. Using the case argument to specify whether to match on string case;
  2. Using the returned Series of boolean values as a mask to get a subset of the DataFrame.
Читайте также:  Convert html to odt

Applying these two should look like this:

pokemon_games = df.loc[df['Name'].str.contains("pokemon", case=False)]

Источник

pandas.Series.str.contains#

Test if pattern or regex is contained within a string of a Series or Index.

Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

Parameters pat str

Character sequence or regular expression.

case bool, default True

flags int, default 0 (no flags)

Flags to pass through to the re module, e.g. re.IGNORECASE.

na scalar, optional

Fill value for missing values. The default depends on dtype of the array. For object-dtype, numpy.nan is used. For StringDtype , pandas.NA is used.

regex bool, default True

If True, assumes the pat is a regular expression.

If False, treats the pat as a literal string.

Returns Series or Index of boolean values

A Series or Index of boolean values indicating whether the given pattern is contained within the string of each element of the Series or Index.

Analogous, but stricter, relying on re.match instead of re.search.

Test if the start of each string element matches a pattern.

Same as startswith, but tests the end of string.

Returning a Series of booleans using only a literal pattern.

>>> s1 = pd.Series(['Mouse', 'dog', 'house and parrot', '23', np.NaN]) >>> s1.str.contains('og', regex=False) 0 False 1 True 2 False 3 False 4 NaN dtype: object 

Returning an Index of booleans using only a literal pattern.

>>> ind = pd.Index(['Mouse', 'dog', 'house and parrot', '23.0', np.NaN]) >>> ind.str.contains('23', regex=False) Index([False, False, False, True, nan], dtype='object') 

Specifying case sensitivity using case .

>>> s1.str.contains('oG', case=True, regex=True) 0 False 1 False 2 False 3 False 4 NaN dtype: object 

Specifying na to be False instead of NaN replaces NaN values with False . If Series or Index does not contain NaN values the resultant dtype will be bool , otherwise, an object dtype.

>>> s1.str.contains('og', na=False, regex=True) 0 False 1 True 2 False 3 False 4 False dtype: bool 

Returning ‘house’ or ‘dog’ when either expression occurs in a string.

>>> s1.str.contains('house|dog', regex=True) 0 False 1 True 2 True 3 False 4 NaN dtype: object 

Ignoring case sensitivity using flags with regex.

>>> import re >>> s1.str.contains('PARROT', flags=re.IGNORECASE, regex=True) 0 False 1 False 2 True 3 False 4 NaN dtype: object 

Returning any digit using regular expression.

>>> s1.str.contains('\\d', regex=True) 0 False 1 False 2 False 3 True 4 NaN dtype: object 

Ensure pat is a not a literal pattern when regex is set to True. Note in the following example one might expect only s2[1] and s2[3] to return True . However, ‘.0’ as a regex matches any character followed by a 0.

>>> s2 = pd.Series(['40', '40.0', '41', '41.0', '35']) >>> s2.str.contains('.0', regex=True) 0 True 1 True 2 False 3 True 4 False dtype: bool 

Источник

Pandas – Search for String in DataFrame Column

In this tutorial, we will look at how to search for a string (or a substring) in a pandas dataframe column with the help of some examples.

How to check if a pandas series contains a string?

Search for string in a pandas column

You can use the pandas.series.str.contains() function to search for the presence of a string in a pandas series (or column of a dataframe). You can also pass a regex to check for more custom patterns in the series values. The following is the syntax:

📚 Discover Online Data Science Courses & Programs (Enroll for Free)

Introductory ⭐

Intermediate ⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

# usnig pd.Series.str.contains() function with default parameters df['Col'].str.contains("string_or_pattern", case=True, flags=0, na=None, regex=True)

It returns a boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

The case parameter tells whether to match the string in a case-sensitive manner or not.

The regex parameter tells the function that you want to match for a specific regex pattern.

The flags parameter can be used to pass additional flags for the regex match through to the re module (for example re.IGNORECASE )

Let’s look at some examples to see the above syntax in action

Search for string in pandas column or series

Pass the string you want to check for as an argument.

import pandas as pd # create a pandas series players = pd.Series(['Rahul Dravid', 'Yuvraj Singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli']) # names with 'Singh' print(players.str.contains('Singh', regex=False))
0 False 1 True 2 False 3 True 4 False dtype: bool

Here, we created a pandas series containing names of some India’s top cricketers. We then find the names containing the word “Singh” using the str.contains() function. We also pass regex=False to indicate not to assume the passed value as a regex pattern. In this case, you can also go with the default regex=True as it would not make any difference.

Also note that we get the result as a pandas series of boolean values representing which of the values contained the given string. You can use this series to filter values in the original series.

For example, let’s only print out the names containing the word “Singh”

# display the type type(players.str.contains('Singh')) # filter for names containing 'Singh' print(players[players.str.contains('Singh')])
1 Yuvraj Singh 3 Mahendra Singh Dhoni dtype: object

Here we applied the .str.contains() function on a pandas series. Note that you can also apply it on individual columns of a pandas dataframe.

# create a dataframe df = pd.DataFrame(< 'Name': ['Rahul Dravid', 'Yuvraj Singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli'], 'IPL Team': ['RR', 'KXIP', 'MI', 'CSK', 'RCB'] >) # filter for names that have "Singh" print(df[df['Name'].str.contains('Singh', regex=False)])
Name IPL Team 1 Yuvraj Singh KXIP 3 Mahendra Singh Dhoni CSK

Search for string irrespective of case

By default, the pd.series.str.contains() function’s string searches are case sensitive.

# create a pandas series players = pd.Series(['Rahul Dravid', 'yuvraj singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli']) # names with 'Singh' irrespective of case print(players.str.contains('Singh', regex=False))
0 False 1 False 2 False 3 True 4 False dtype: bool

We get False for “yuvraj singh” because it does not contain the word “Singh” in the same case.

You can, however make the function search for strings irrespective of the case by passing False to the case parameter.

# create a pandas series players = pd.Series(['Rahul Dravid', 'yuvraj singh', 'Sachin Tendulkar', 'Mahendra Singh Dhoni', 'Virat Kohli']) # names with 'Singh' irrespective of case print(players.str.contains('Singh', regex=False, case=False))
0 False 1 True 2 False 3 True 4 False dtype: bool

Search for a matching regex pattern in column

You can also pass regex patterns to the above function for searching more complex values/patterns in the series.

# create a pandas series balls = pd.Series(['wide', 'no ball', 'wicket', 'dot ball', 'runs']) # check for wickets or dot balls good_balls = balls.str.contains('wicket|dot ball', regex=True) # display good balls print(good_balls)
0 False 1 False 2 True 3 True 4 False dtype: bool

Here we created a pandas series with values representing different outcomes when a blower bowls a ball in cricket. Let’s say we want to find all the good balls which can be defined as either a wicket or a dot ball. We used the regex pattern ‘wicket|dot ball’ to match with either “wicket” or “dot ball”.

You can similarly write more complex regex patterns depending on your use-case to match values in a pandas series.

For more the pd.Series.str.contains() function, refer to its documentation.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5

Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.

Author

Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects. View all posts

Data Science Parichay is an educational website offering easy-to-understand tutorials on topics in Data Science with the help of clear and fun examples.

Источник

Оцените статью