- python3: regex, find all substrings that starts with and end with certain string
- 3 Answers 3
- Read in every line that starts with a certain character from a file
- 3 Answers 3
- RegEx for matching lines starting with [ and ending with ] [duplicate]
- 1 Answer 1
- Test
- DEMO
- RegEx
- RegEx Circuit
- Efficiently get all lines starting with given string for a large text file
- 1 Answer 1
python3: regex, find all substrings that starts with and end with certain string
How can I do this? I’m pretty new to regex method, and would really appreciate it if anyone can show how to do this in different method within regex, and explain what’s going on.
Your expected result is a little short. There are 96 substrings starting with a digit and ending on an alphabet letter possible.
3 Answers 3
\w will match any wordcharacter which consists of numbers, alphabets and the underscore sign. You need to use [a-zA-Z] to capture letters only. See this example.
import re a = '1253abcd4567efgh8910ijkl' b = re.findall('(\d+[A-Za-z]+)',a)
['1253abcd', '4567efgh', '8910ijkl']
\d will match digits. \d+ will match one or more consecutive digits. For e.g.
>>> re.findall('(\d+)',a) ['1253', '4567', '8910']
Similarly [a-zA-Z]+ will match one or more alphabets.
>>> re.findall('([a-zA-Z]+)',a) ['abcd', 'efgh', 'ijkl']
Now put them together to match what you exactly want.
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
So you are actually over capturing what you need. Refine your regular expression a bit:
>>> re.findall(r'(\d+[a-z]+)', a, re.I) ['1253abcd', '4567efgh', '8910ijkl']
The re.I makes your expression case insensitive, so it will match upper and lower case letters as well:
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA') ['12124adbad'] >>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA', re.I) ['12124adbad', '13434AGDFDF', '434348888AAA']
Read in every line that starts with a certain character from a file
I am trying to read in every line in a file that starts with an ‘X:’. I don’t want to read the ‘X:’ itself just the rest of the line that follows.
with open("hnr1.abc","r") as file: f = file.read() for line in f: if line.startswith("X:"): id.append(f.line[2:]) print(id)
3 Answers 3
with open("hnr1.abc","r") as fi: for ln in fi: if ln.startswith("X:"): id.append(ln[2:]) print(id)
dont use names like file or line
note the append just uses the item name not as part of the file
by pre-reading the file into memory the for loop was accessing the data by character not by line
Also if i wanted to read the content lines that started with «T:» but only the every first and second lines that begin with T:, how might i go about that?
Just keep a counter as they are found — that is starting to become a state machine — and that is a much bigger discussion
Read every line in the file ( for loop)
Select lines that contains X:
Slice the line with index 0: with starting char’s/string as X: = ln[0:]
Print lines that begins with X:
for ln in input_file: if ln.startswith('X:'): X_ln = ln[0:] print (X_ln)
While this might answer the question, you should edit your answer to include some explanation for why this solves the issue in the question. This makes it more valuable to those who come across the same issue later on.
Your code doesn’t work because it doesn’t answer the original question. i am trying to read in every line in a file that starts with an ‘X:’. I don’t want to read the ‘X:’ itself just the rest of the line that follows.
Karel, the code will print all lines, that starts / begins with ‘X:’ prints complete line not just ‘X:’
for line in f: search = line.split if search[0] = "X": storagearray.extend(search)
That should give you an array of all the lines you want, but they’ll be split into separate words. Also, you’ll need to have defined storagearray before we call it in the above block of code. It’s an inelegant solution, as I’m a learner myself, but it should do the job!
edit: If you want to output the lines, simply use python’s inbuilt print function:
str(storagearray) print storagearray
RegEx for matching lines starting with [ and ending with ] [duplicate]
I am trying to find the lines that start with [ and ends with ] in a file. I am using regex, but not able to get the result. I have tried regex with various options, e.g., \s, \S, \w and \W.
import re infile=open("C:\\Users\\Downloads\\Files\\processed.csv","r") myregex = re.compile(r'(^\[)(\]$)') list=[] for groups in myregex.findall(infile.read()): item=''.join(groups) cleanitem=item.replace('\n','') list.append(cleanitem) print (list) infile.close()
1 Answer 1
Here, we can find a simple expression with a capturing group, if we like, something similar to:
Test
# coding=utf8 # the above tag defines encoding for this document and is for Python 2.x compatibility import re regex = r"^(\[.+\])$" test_str = ("[ and ends with ]\n" " [ and ends with ]") matches = re.finditer(regex, test_str, re.MULTILINE) for matchNum, match in enumerate(matches, start=1): print ("Match was found at -: ".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group())) for groupNum in range(0, len(match.groups())): groupNum = groupNum + 1 print ("Group found at -: ".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum))) # Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
DEMO
RegEx
If this expression wasn’t desired, it can be modified/changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
Efficiently get all lines starting with given string for a large text file
I have a large text file with around 700k lines. For a given string, I would like to be able to efficiently find all lines in the file that start with the string. I would like to query it repeatedly and so each query should be fast and I am not so concerned about a larger set up time initially. I’m guessing that I could do this more efficiently by transforming the file so that the lines are already in alphabetical order? If so what’s a good way to do this? Or is there a different data structure I could consider? Once the data has been prepared, what is an efficient way to search? I would be comfortable doing something basic with regular expressions or reading line by line and testing the line start, but both of these solutions seem slack? It seems like there should be a well understood algorithm for this kind of thing?
This is probably a question for Computer Science Stackexchange rather than here. The best data structure and algorithm for this probably doesn’t depend (mainly) on what programming language you’re using.
1 Answer 1
There are two questions I need to ask before giving you the best solution:
- Is the text in lexicographical order?
- If not, how much accuracy is in the alphabetical order? (how many characters in a line until mistakes can happen in the sorting)
If your file is in lexicographical order, you’re in luck. You’ll be able to use a modification of a binary search to narrow down the lines that start with your given string.
If your file is only in alphabetical order, you can narrow it down like the first solution only until it’s «out of accuracy». After that, you’ll sadly need to search one by one on those lines.
I’ll try my best to build you a fitting code:
lines = givenstring = low = 0 high = len(lines) i = 0 lastinstance = len(lines) while i < len(givenstring)-1: #Finding the first instance: while low < high: mid = (low+high)//2 if (mid == 0 or ord(givenstring[i]) >ord(lines[mid-1][i])) and ord(lines[mid][i]) == ord(givenstring[i]): firstinstance = mid break elif ord(givenstring[i]) > ord(lines[mid][i]): low = mid + 1 else: high = mid #Finding the last instance: low = firstinstance high = lastinstance while low < high: mid = (low+high)//2 if (mid == len(lines)-1 or ord(givenstring[i]) < ord(lines[mid+1][i])) and ord(lines[mid][i]) == ord(givenstring[i]): lastinstance = mid break elif ord(givenstring[i]) >ord(lines[mid][i]): low = mid + 1 else: high = mid low = firstinstance high = lastinstance i += 1 print(firstinstance) print(lastinstance)