How to extract the substring between two markers?
Let’s say I have a string ‘gfgfdAAA1234ZZZuijjk’ and I want to extract just the ‘1234’ part. I only know what will be the few characters directly before AAA , and after ZZZ the part I am interested in 1234 . With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
23 Answers 23
Using regular expressions — documentation for further reference
import re text = 'gfgfdAAA1234ZZZuijjk' m = re.search('AAA(.+?)ZZZ', text) if m: found = m.group(1) # found: 1234
import re text = 'gfgfdAAA1234ZZZuijjk' try: found = re.search('AAA(.+?)ZZZ', text).group(1) except AttributeError: # AAA, ZZZ not found in the original string found = '' # apply your error handling # found: 1234
The second solution is better, if the pattern matches most of the time, because its Easier to ask for forgiveness than permission..
@Alexander, no, group(0) will return full matched string: AAA1234ZZZ, and group(1) will return only characters matched by first group: 1234
In this expression the ? modifies the + to be non-greedy, ie. it will match any number of times from 1 upwards but as few as possible, only expanding as necessary. without the ?, the first group would match gfgfAAA2ZZZkeAAA43ZZZonife as 2ZZZkeAAA43, but with the ? it would only match the 2, then searching for multiple (or having it stripped out and search again) would match the 43.
>>> s = 'gfgfdAAA1234ZZZuijjk' >>> start = s.find('AAA') + 3 >>> end = s.find('ZZZ', start) >>> s[start:end] '1234'
Then you can use regexps with the re module as well, if you want, but that’s not necessary in your case.
The question seems to imply that the input text will always contain both «AAA» and «ZZZ». If this is not the case, your answer fails horribly (by that I mean it returns something completely wrong instead of an empty string or throwing an exception; think «hello there» as input string).
Voteup, but I would use «x = ‘AAA’ ; s.find(x) + len(x)» instead of «s.find(‘AAA’) + 3» for maintainability.
If any of the tokens can’t be found in the s , s.find will return -1 . the slicing operator s[begin:end] will accept it as valid index, and return undesired substring.
regular expression
The above as-is will fail with an AttributeError if there are no «AAA» and «ZZZ» in your_text
string methods
your_text.partition("AAA")[2].partition("ZZZ")[0]
The above will return an empty string if either «AAA» or «ZZZ» don’t exist in your_text .
This answer probably deserves more up votes. The string method is the most robust way. It does not need a try/except.
. nice, though limited. partition is not regex based, so it only works in this instance because the search string was bounded by fixed literals
Upvoting for the string method, there is no need for regex in something this simple, most languages have a library function for this
Surprised that nobody has mentioned this which is my quick version for one-off scripts:
>>> x = 'gfgfdAAA1234ZZZuijjk' >>> x.split('AAA')[1].split('ZZZ')[0] '1234'
Adding an if s.find(«ZZZ») > s.find(«AAA»): to it, avoids issues if ‘ZZZ` isn’t in the string, which would return ‘1234uuijjk’
@tzot’s answer (stackoverflow.com/a/4917004/358532) with partition instead of split seems more robust (depending on your needs), as it returns an empty string if one of the substrings isn’t found.
you can do using just one line of code
>>> import re >>> re.findall(r'\d','gfgfdAAA1234ZZZuijjk') >>> ['1234']
import re print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
You can use re module for that:
>>> import re >>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups() ('1234,)
In python, extracting substring form string can be done using findall method in regular expression ( re ) module.
>>> import re >>> s = 'gfgfdAAA1234ZZZuijjk' >>> ss = re.findall('AAA(.+)ZZZ', s) >>> print ss ['1234']
text = 'I want to find a string between two substrings' left = 'find a ' right = 'between two' print(text[text.index(left)+len(left):text.index(right)])
If the text does not include the markers, throws a ValueError: substring not found exception. That is good,
>>> s = '/tmp/10508.constantstring' >>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
With sed it is possible to do something like this with a string:
echo «$STRING» | sed -e «s|.*AAA\(.*\)ZZZ.*|\1|»
And this will give me 1234 as a result.
You could do the same with re.sub function using the same regex.
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk') '1234'
In basic sed, capturing group are represented by \(..\) , but in python it was represented by (..) .
You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.
def FindSubString(strText, strSubString, Offset=None): try: Start = strText.find(strSubString) if Start == -1: return -1 # Not Found else: if Offset == None: Result = strText[Start+len(strSubString):] elif Offset == 0: return Start else: AfterSubString = Start+len(strSubString) Result = strText[AfterSubString:AfterSubString + int(Offset)] return Result except: return -1 # Example: Text = "Thanks for contributing an answer to Stack Overflow!" subText = "to" print("Start of first substring in a text:") start = FindSubString(Text, subText, 0) print(start); print("") print("Exact substring in a text:") print(Text[start:start+len(subText)]); print("") print("What is after substring \"%s\"?" %(subText)) print(FindSubString(Text, subText)) # Your answer: Text = "gfgfdAAA1234ZZZuijjk" subText1 = "AAA" subText2 = "ZZZ" AfterText1 = FindSubString(Text, subText1, 0) + len(subText1) BeforText2 = FindSubString(Text, subText2, 0) print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))
using regex to get a substring from a string
I have a string in the form of: integer , integer , a comma separated list of strings, integer for example:
"0, 0, ['REFERENCED', 'UPTODATE', 'LRU'], 1"
I want to return this substring [‘REFERENCED’, ‘UPTODATE’, ‘LRU’] I thought of using split(«, «) and then joining things together but it will just be so complicated. How to do that with regex?
What syntax is that input? Is it perhaps compatible with Python source code? Can you use ast.literal_eval ?
4 Answers 4
Just write a regular expression to capture a group that consist of a [ , any characters and then a ] .
>>> import re >>> s = "0, 0, ['REFERENCED', 'UPTODATE', 'LRU'], 1" >>> re.search(r'(\[.*\])', s).group(1) "['REFERENCED', 'UPTODATE', 'LRU']"
If the input really is this well structured, you could use ast.literal_eval :
>>> import ast >>> ast.literal_eval(s)[2] ['REFERENCED', 'UPTODATE', 'LRU']
To safely evaluate strings that contain python literals and pull the third element out of the tuple .
s = "0, 0, ['REFERENCED', 'UPTODATE', 'LRU'], 1" start = s.find("[") end = s.rfind("]") print(s[start:end+1]) ['REFERENCED', 'UPTODATE', 'LRU']
There is no need for a regex. Wrap your string in brackets to make a string representation of a list, then use ast.literal_eval to turn it into an actual list.
import ast s = "0, 0, ['REFERENCED', 'UPTODATE', 'LRU'], 1" outer_list = ast.literal_eval('[' + s + ']') inner_list = outer_list[2] print(inner_list)
You may be tempted to use eval instead of ast.literal_eval . Resist the temptation. Using eval is unsafe because it will evaluate any Python expression, even if it contains nasty stuff such as instructions to delete files from your hard drive. You can use ast.literal_eval without fear because it only parses strings, numbers, tuples, lists, dicts, booleans, and None .