Python split and regex

Содержание

Python Regex Split String Using re.split()
Table of contents
How to use re.split() function
Syntax
Return value
Regex example to split a string into words
Limit the number of splits
Regex to Split string with multiple delimiters
Regex to split string on five delimiters
Regex to split String into words with multiple word boundary delimiters
Split strings by delimiters and specific word
Regex split a string and keep the separators
Regex split string by ignoring case
String’s split() method vs. regex split()
Split string by upper case words

Python Regex Split String Using re.split()

In this article, will learn how to split a string based on a regular expression pattern in Python. The Pythons re module’s re.split() method split the string by the occurrences of the regex pattern, returning a list containing the resulting substrings.

After reading this article you will be able to perform the following split operations using regex in Python.

Operation	Description
re.split(pattern, str)	Split the string by each occurrence of the pattern .
re.split(pattern, str, maxsplit=2)	Split the string by the occurrences of the pattern . Limit the number of splits to 2
re.split(p1\|p2, str)	Split string by multiple delimiter patterns ( p1 and p2 ).

Python regex split operations

How to use re.split() function

Before moving further, let’s see the syntax of Python’s re.split() method.

Syntax

re.split(pattern, string, maxsplit=0, flags=0)

The regular expression pattern and target string are the mandatory arguments. The maxsplit , and flags are optional.

pattern : the regular expression pattern used for splitting the target string.
string : The variable pointing to the target string (i.e., the string we want to split).
maxsplit : The number of splits you wanted to perform. If maxsplit is 2, at most two splits occur, and the remainder of the string is returned as the final element of the list.
flags : By default, no flags are applied.
There are many regex flags we can use. For example, the re.I is used for performing case-insensitive searching.

Note: If capturing parentheses are used in the pattern, then the text of all groups in the pattern is also returned as part of the resulting list.

Return value

It split the target string as per the regular expression pattern, and the matches are returned in the form of a list.

If the specified pattern is not found inside the target string, then the string is not split in any way, but the split method still generates a list since this is the way it’s designed. However, the list contains just one element, the target string itself.

Regex example to split a string into words

Now, let’s see how to use re.split() with the help of a simple example. In this example, we will split the target string at each white-space character using the \s special sequence.

Let’s add the + metacharacter at the end of \s . Now, The \s+ regex pattern will split the target string on the occurrence of one or more whitespace characters. Let’s see the demo.

import re target_string = "My name is maximums and my luck numbers are 12 45 78" # split on white-space word_list = re.split(r"\s+", target_string) print(word_list) # Output ['My', 'name', 'is', 'maximums', 'and', 'my', 'luck', 'numbers', 'are', '12', '45', '78']

As you can see in the output, we got the list of words separated by whitespace.

Limit the number of splits

The maxsplit parameter of re.split() is used to define how many splits you want to perform.

In simple words, if the maxsplit is 2, then two splits will be done, and the remainder of the string is returned as the final element of the list.

So let’s take a simple example to split a string on the occurrence of any non-digit. Here we will use the \D special sequence that matches any non-digit character.

import re target_string = "12-45-78" # Split only on the first occurrence # maxsplit is 1 result = re.split(r"\D", target_string, maxsplit=1) print(result) # Output ['12', '45-78'] # Split on the three occurrence # maxsplit is 3 result = re.split(r"\D", target_string, maxsplit=3) print(result) # Output ['12', '45', '78']

Regex to Split string with multiple delimiters

In this section, we’ll learn how to use regex to split a string on multiple delimiters in Python.

For example, using the regular expression re.split() method, we can split the string either by the comma or by space.

With the regex split() method, you will get more flexibility. You can specify a pattern for the delimiters where you can specify multiple delimiters, while with the string’s split() method, you could have used only a fixed character or set of characters to split a string.

Let’s take a simple example to split the string either by the hyphen or by the comma.

Example to split string by two delimiters

import re target_string = "12,45,78,85-17-89" # 2 delimiter - and , # use OR (|) operator to combine two pattern result = re.split(r"-|,", target_string) print(result) # Output ['12', '45', '78', '85', '17', '89']

Regex to split string on five delimiters

Here we will use regex to split a string with five delimiters Including the dot, comma, semicolon, a hyphen, and space followed by any amount of extra whitespace.

import re target_string = "PYnative dot.com; is for, Python-developer" # Pattern to split: [-;,.\s]\s* result = re.split(r"[-;,.\s]\s*", target_string) print(result) # Output ['PYnative', 'dot', 'com', 'is', 'for', 'Python', 'developer']

Note: we used [] meta character to indicate a list of delimiter characters. The [] matches any single character in brackets. For example, [-;,.\s] will match either hyphen, comma, semicolon, dot, and a space character.

Regex to split String into words with multiple word boundary delimiters

In this example, we will use the [\b\W\b]+ regex pattern to cater to any Non-alphanumeric delimiters. Using this pattern we can split string by multiple word boundary delimiters that will result in a list of alphanumeric/word tokens.

Note: The \W is a regex special sequence that matches any Non-alphanumeric character. Non-alphanumeric means no letter, digit, and underscore.

import re target_string = "PYnative! dot.com; is for, Python-developer?" result = re.split(r"[\b\W\b]+", target_string) print(result) # Output ['PYnative', 'dot', 'com', 'is', 'for', 'Python', 'developer', '']

Split strings by delimiters and specific word

import re text = "12, and45,78and85-17and89-97" # split by word 'and' space, and comma result = re.split(r"and|[\s,-]+", text) print(result) # Output ['12', '', '45', '78', '85', '17', '89', '97']

Regex split a string and keep the separators

As I told you at the start of the article if capturing parentheses are used in the pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

Note: You are capturing the group by writing pattern inside the ( , ) .

In simple terms, be careful while using the re.split() method when the regular expression pattern is enclosed in parentheses to capture groups. If capture groups are used, then the matched text is also included in the resulted list.

It is helpful when you want to keep the separators/delimiter in the resulted list.

import re target_string = "12-45-78." # Split on non-digit result = re.split(r"\D+", target_string) print(result) # Output ['12', '45', '78', ''] # Split on non-digit and keep the separators # pattern written in parenthese result = re.split(r"(\D+)", target_string) print(result) # Output ['12', '-', '45', '-', '78', '.', '']

Regex split string by ignoring case

There is a possibility that the string contains lowercase and upper case letters.

For example, you want to split a string on the specific characters or range of characters, but you don’t know whether that character/word is an uppercase or lowercase letter or a combination of both. Here you can use the re.IGNORECASE or re.I flag inside the re.split() method to perform case-insensitive splits.

import re # Without ignoring case print(re.split('[a-z]+', "7J8e7Ss3a")) # output ['7J8', '7S', '3', ''] # With ignoring case print(re.split('[a-z]+', "7J8e7Ss3a", flags=re.IGNORECASE)) # output ['7', '8', '7', '3', ''] # Without ignoring case print(re.split(r"emma", "Emma knows Python.EMMA loves Data Science")) # output ['Emma knows Python.EMMA loves Data Science'] # With ignoring case print(re.split(r"emma", "Emma knows Python.EMMA loves Data Science", flags=re.IGNORECASE)) # output ['', ' knows Python.', ' loves Data Science']

String’s split() method vs. regex split()

Now let’s think of the default split() method in Python, which is specific to strings. As you most probably know, the default split() method splits a string by a specific delimiter. However, please note that this delimiter is a fixed string that you define inside the method’s parentheses.

The difference between the defaults split() and the regular expressions split() methods are enormous. There is way more flexibility when using the regular expressions split, which can prove very useful in some scenarios and for specific tasks.

With the re.split() method, you can specify a pattern for the delimiter, while with the defaults split() method, you could have used only a fixed character or set of characters.
Also, using re.split() we can split a string by multiple delimiters.

Split string by upper case words

For example, you have a string like “EMMA loves PYTHON and ML”, and you wanted to split it by uppercase words to get results like [‘HELLO there’, ‘HOW are’, ‘YOU’]

import re print(re.split(r"\s(?=[A-Z])", "EMMA loves PYTHON and ML")) # output ['EMMA loves', 'PYTHON and', 'ML']

Explanation

We used lookahead regex \s(?=[A-Z]) .
This regex will split at every space( \s ), followed by a string of upper-case letters([ A-Z ]) that end in a word-boundary( \b ).

Источник