Python re multiple flags

Flags

Just like options change the default behavior of command line tools, flags are used to change aspects of RE behavior. You have already seen flags for ignoring case and changing behavior of line anchors. Flags can be applied to entire RE using the flags optional argument or to a particular portion of RE using special groups. And both of these forms can be mixed up as well. In regular expression parlance, flags are also known as modifiers.

Flags already seen will again be discussed in this chapter for completeness sake. You’ll also learn how to combine multiple flags.

re.IGNORECASE

First up, the flag to ignore case while matching alphabets. When flags argument is used, this can be specified as re.I or re.IGNORECASE constants.

>> bool(re.search(r'cat', 'Cat', flags=re.IGNORECASE)) True >>> re.findall(r'c.t', 'Cat cot CATER ScUtTLe', flags=re.I) ['Cat', 'cot', 'CAT', 'cUt'] # without flag, you need to use: r'[a-zA-Z]+' # with flag, can also use: r'[A-Z]+' >>> re.findall(r'[a-z]+', 'Sample123string42with777numbers', flags=re.I) ['Sample', 'string', 'with', 'numbers'] 

re.DOTALL

Use re.S or re.DOTALL to allow the . metacharacter to match newline characters as well.

>> re.sub(r'the.*ice', 'X', 'Hi there\nHave a Nice Day') 'Hi there\nHave a Nice Day' # re.S flag will allow newline character to be matched as well >>> re.sub(r'the.*ice', 'X', 'Hi there\nHave a Nice Day', flags=re.S) 'Hi X Day' 

Multiple flags can be combined using the bitwise OR operator.

Читайте также:  Css make element inline

re.MULTILINE

As seen earlier, re.M or re.MULTILINE flag would allow the ^ and $ anchors to work line wise.

>> bool(re.search(r'^top', 'hi hello\ntop spot', flags=re.M)) True # check if any line in the string ends with 'ar' >>> bool(re.search(r'ar$', 'spare\npar\ndare', flags=re.M)) True 

re.VERBOSE

The re.X or re.VERBOSE flag is another provision like named capture groups to help add clarity to RE definitions. This flag allows you to use literal whitespaces for aligning purposes and add comments after the # character to break down complex RE into multiple lines.

>> pat = re.compile(r''' . \A( # group-1, captures first 3 columns . (?:[^,]+,) # non-capturing group to get the 3 columns . ) . ([^,]+) # group-2, captures 4th column . ''', flags=re.X) >>> pat.sub(r'\1(\2)', '1,2,3,4,5,6,7') '1,2,3,(4),5,6,7' 

There are a few workarounds if you need to match whitespace and # characters literally. Here’s the relevant quote from documentation:

Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *? , (?: or (?P . When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

>> bool(re.search(r't\ a', 'cat and dog', flags=re.X)) True >>> bool(re.search(r't[ ]a', 'cat and dog', flags=re.X)) True >>> bool(re.search(r't\x20a', 'cat and dog', flags=re.X)) True >>> re.search(r'a#b', 'apple a#b 123', flags=re.X)[0] 'a' >>> re.search(r'a\#b', 'apple a#b 123', flags=re.X)[0] 'a#b' 

Inline comments

Comments can also be added using the (?#comment) special group. This is independent of the re.X flag.

)(?#3-cols)([^,]+)(?#4th-col)') >>> pat.sub(r'\1(\2)', '1,2,3,4,5,6,7') '1,2,3,(4),5,6,7' 

Inline flags

  • (?flags:pat) will apply flags only for this portion
  • (?-flags:pat) will negate flags only for this portion
  • (?flags-flags:pat) will apply and negate particular flags only for this portion
  • (?flags) will apply flags for the whole RE definition
    • can be specified only at the start of RE definition
    • if anchors are needed, they should be specified after this group

    In these ways, flags can be specified precisely only where it is needed. The flags are to be given as single letter lowercase version of short form constants. For example, i for re.I , s for re.S and so on, except L for re.L or re.LOCALE (discussed in the re.ASCII section). And as can be observed from the below examples, these do not act as capture groups.

    >> re.findall(r'Cat[a-z]*\b', 'Cat SCatTeR CATER cAts') ['Cat'] # case-insensitive only for the '[a-z]*' portion >>> re.findall(r'Cat(?i:[a-z]*)\b', 'Cat SCatTeR CATER cAts') ['Cat', 'CatTeR'] # case-insensitive for the whole RE definition using flags argument >>> re.findall(r'Cat[a-z]*\b', 'Cat SCatTeR CATER cAts', flags=re.I) ['Cat', 'CatTeR', 'CATER', 'cAts'] # case-insensitive for the whole RE definition using inline flags >>> re.findall(r'(?i)Cat[a-z]*\b', 'Cat SCatTeR CATER cAts') ['Cat', 'CatTeR', 'CATER', 'cAts'] # case-sensitive only for the 'Cat' portion >>> re.findall(r'(?-i:Cat)[a-z]*\b', 'Cat SCatTeR CATER cAts', flags=re.I) ['Cat', 'CatTeR'] 

    Cheatsheet and Summary

    This chapter showed some of the flags that can be used to change the default behavior of RE definition. And more special groupings were covered.

    Exercises

    a) Remove from the first occurrence of hat to the last occurrence of it for the given input strings. Match these markers case insensitively.

    >> s2 = 'it this hat is sliced HIT.' >>> pat = re.compile() ##### add your solution here >>> pat.sub('', s1) 'But Cool Te' >>> pat.sub('', s2) 'it this .' 

    b) Delete from start if it is at the beginning of a line up to the next occurrence of the end at the end of a line. Match these markers case insensitively.

    >> pat = re.compile() ##### add your solution here >>> print(pat.sub('', para)) good start hi there 42 bye 
    >> s2 = 'Nice and cool this is' >>> s3 = 'What is so nice and cool about This?' >>> s4 = 'nice,cool,This' >>> s5 = 'not nice This?' >>> s6 = 'This is not cool' >>> pat = re.compile() ##### add your solution here >>> bool(pat.search(s1)) True >>> bool(pat.search(s2)) False >>> bool(pat.search(s3)) True >>> bool(pat.search(s4)) True >>> bool(pat.search(s5)) False >>> bool(pat.search(s6)) False 

    d) For the given input strings, match if the string begins with Th and also contains a line that starts with There .

    >> s2 = 'This is a mess\nYeah?\nThereeeee' >>> s3 = 'Oh\nThere goes the fun' >>> s4 = 'This is not\ngood\nno There' >>> pat = re.compile() ##### add your solution here >>> bool(pat.search(s1)) True >>> bool(pat.search(s2)) True >>> bool(pat.search(s3)) False >>> bool(pat.search(s4)) False 
    • re.compile(r’\Aden|ly\Z’, flags=re.DEBUG)
    • re.compile(r’\b(0x)?[\da-f]+\b’, flags=re.DEBUG)
    • re.compile(r’\b(?:0x)?[\da-f]+\b’, flags=re.I|re.DEBUG)

    Источник

    Using regular expression flags in Python

    This week’s post is about regular expression (regex) flags. You will learn how to use regex flags to:

    • Add comments to your regular expressions.
    • Do case-insensitive matching.
    • Allow patterns to match specific lines instead of the whole text.
    • Match patterns spanning over multiple lines.

    You need a basic understanding of regexes to read this post. If you want to learn the basics of regexes, read this tutorial:

    Python’s built-in regex module is re.

    What are regex flags?

    Regex flags allow useful regex features to be turned on. E.g.

    Allow case-insensitive matching so that «dave» is treated the same as «Dave».

    4 useful regex flags in Python are:

    1. VERBOSE. Allow inline comments and extra whitespace.
    2. IGNORECASE. Do case-insensitive matches.
    3. MULTILINE. Allow anchors ( ^ and $ ) to match the beginnings and ends of lines instead of matching the beginning and end of the whole text.
    4. DOTALL. Allow dot ( . ) to match any character, including a newline. (The default behavior of dot is to match anything, except for a newline.)

    How can I use regex flags?

    Each regex flag can be activated in three different ways:

    • Activated with the long argument name (e.g. re.IGNORECASE ).
    • Activated with the short argument name (e.g. re.I ).
    • Activated with the inline name (e.g. «(?i)» ).

    To use short and long argument names, you pass them as arguments to re.compile, re.search, re.match, re.fullmatch, re.split, re.findall, re.finditer, re.sub, and re.subn. E.g.

    import re re.match("dave", "Dave") 
    re.match("dave", "Dave", flags=re.IGNORECASE) 
    re.match("dave", "Dave", flags=re.I) 
    re.findall("dave", "my friend Dave is named dave.", flags=re.I) 

    Using short and long arguments, flags can be combined using the operator | . E.g.

    text = """ Dave is my friend. dave is named dave. Dave is dave? """ re.findall("^dave", text, flags=re.I) 
    re.findall("^dave", text, flags=re.M) 
    re.findall("^dave", text, flags=re.I | re.M) 

    To use inline flag names, include them in the regex:

    There are two ways to use inline flag names:

    • Globally, which turns the flag on for the entire regex.
    • Locally, which turns the flag on or off for part of a regex.

    To use an inline flag name globally, write it like «(?i)» and include it at the beginning of the regex:

    DeprecationWarning: Flags not at the start of the expression 'dave(?i)'

    To use an inline flag name locally, write it like «(?i. )» instead of «(?i). » . E.g.

    re.match("hello (?i:dave)", "HELLO Dave") 
    re.match("hello (?i:dave)", "hello Dave") 

    To turn a local flag off, write it like «(?-i. )» instead of «(?i. )» . E.g.

    re.match("(?i)hello (?-i:there) dave", "HELLO THERE DAVE") 
    re.match("(?i)hello (?-i:there) dave", "HELLO there DAVE") 

    You can write multiple inline flags like «(?i)(?m). » , and you can also combine them like «(?im). » . E.g.

    text = """ Dave is my friend. dave is named dave. Dave is dave? """ re.findall("(?i)(?m)^dave", text) 

    The regex flag VERBOSE

    The VERBOSE flag allows inline comments and extra whitespace. E.g.

    pattern = """(?x) from [ ]+ [0-9:]+ # start time [ ]+ to [ ]+ [0-9:]+ # end time """ re.search(pattern, "Event: Lunch from 10 to 11") 

    If you want to match whitespace, you must explicitly denote it using «[ ]» or «\\t» . (See «[ ]+» in the above pattern.)

    The benefit of using the VERBOSE flag is that you can create regexes that are more readable and easier to maintain for you and your coworkers. E.g.

    """(?x) M ( CM | CD | D?C ) ( XC | XL | L?X ) ( IX | IV | V?I ) """ 

    The regex flag IGNORECASE

    The IGNORECASE flag makes all matching case-insensitive. E.g.

    sql_code = """ SELECT Students.name FROM Packages P1 Inner Join Friends on Friends.id = P1.id INNER JOIN Packages P2 On P2.id = Friends.friend_id join join Students ON Students.id = P1.id WHERE P2.salary > P1.salary order BY P2.salary ; """ sql_keywords = "(?i)select|from|inner join|on|where|order by" re.findall(sql_keywords, sql_code) 
    ['SELECT', 'FROM', 'Inner Join', 'on', 'INNER JOIN', 'On', 'ON', 'WHERE', 'order BY'] 

    The IGNORECASE flag is useful when the pattern that you are searching for may or may not be capitalized or not. E.g.

      When you search text for mentions of your friend Dave, you want to match «dave» and «Dave». You use the regex

    The regex flag MULTILINE

    The MULTILINE flag allows anchors ( ^ and $ ) to match the beginnings and ends of lines instead of matching the beginning and end of the whole text.

    python_code = """\ def f(x): return x + 4 class Dog: def bark(self): print("bark") """ python_function = "^[ ]*def \w+" re.findall(python_function, python_code, flags=re.MULTILINE) 

    Without using the MULTILINE flag, only «def f» would match:

    re.findall(python_function, python_code) 

    The MULTILINE flag is useful when the pattern that you are searching for looks at the beginning of a line (or at the end of a line). E.g.

      You want to find all lines in a code file that begin with «def», so you use the regex

    The regex flag DOTALL

    The DOTALL flag allows dot ( . ) to match any character, including a newline. (The default behavior of dot is to match anything, except for a newline.) E.g.

    secret_message = """ Message Date: DATA[2004-09-03]. This is a top secret message from the U.S. Government about Metal Gear RAY (DATA[610d19f8-9d33-4927-9d30- 22ea4c546071]). The rendezvous coordinates are DATA[ 35.89421911 139.94637467 ]. """ data_blob = "DATA\[.*?\]" re.findall(data_blob, secret_message, flags=re.DOTALL) 
    ['DATA[2004-09-03]', 'DATA[610d19f8-9d33-4927-9d30-\n22ea4c546071]', 'DATA[\n 35.89421911\n 139.94637467\n]'] 

    Without using the DOTALL flag, only data blobs that fit on one line would match:

    re.findall(data_blob, secret_message) 

    The DOTALL flag is useful when the pattern that you are searching for may span across multiple lines. E.g.

      You are searching through XML data and want to find CDATA sections. CDATA sections start with

    In conclusion.

    In this article, you learned how to use regex flags to improve your regexes. Regex flags are features that can be turned on to allow for things like case-insensitive matching and the ability to add comments to your regex. They allow your to write regexes like this

    Instead of regexes like this

    Create your own regular expressions that use the regex flags that you learned about today: VERBOSE, IGNORECASE, MULTILINE, and DOTALL.

    If you enjoyed this post, let me know. Share this with your friends and stay tuned for next week’s post. See you then!

    Copyright © 2021 John Lekberg

    Источник

Оцените статью