Python pattern matching file

Содержание

Pattern matching text in a file?
1 Answer 1
fnmatch — Unix filename pattern matching¶
how to match file name in the file using python
3 Answers 3
How do i search directories and find files that match regex?
4 Answers 4
Snippets of functions using glob and a file-walking-regex matcher
Comparing runtimes of the above functions
For your case

Pattern matching text in a file?

I have an input file that looks as follows input file link and need to create an output file that looks like this output file link I started with this but the error handling and pattern matching is messing up the logic (specially the occurrences of : in URL as well as the data). Also, the average in output file is average across the non zero or non null values

with open("input.txt") as f: next(f) # skips header for line in f: cleanline = re.sub('::',':',line) # handles the two :: case newline = re.split("[\t:]",cleanline) #splits on either tab or : print newline x=0 total=0 for i in range(3,7): if newline[i] <> 0 or newline[i] != None: x+=1 total+=total avg=total/x print avg

I’d suggest using the csv module to read and write the files. It can handle a lot of the special case for your. Also suggest you edit your question and show a long example of an input file with all the special cases in it along with what the expected output should look like.

I added images of input and output file to explain what i am trying to do better. thanks for suggesting that. the post was too confusing

Actually what I meant was for you to cut lines from both files and paste them into your question (indented by 4 spaces). That way someone could use the first to be able to run you code, and the second to check results.

1 Answer 1

I would suggest you approach this from a different angle. First, split each line along the tabs and then validate each entry individually. This allows you to compile a regular expression for each entry and compile more precise error messages. A nice way to do this is with tuple unpacking and the split method:

from __future__ import print_function with open("input.txt") as in_file, open("output.txt", 'w') as out_file: next(in_file) # skips header for line in in_file: error_message = [] # remove line break character and split along the tabs id_and_date, user_id, p1, p2, p3, p4, url = line.strip("\n").split("\t") # split the first entry at the first : split_id_date = id_and_date.split(":", 1) if len(split_id_date) == 2: order_id, date = split_id_date elif len(split_id_date) == 1: # assume this is the order id # or do something order_id, date = (split_id_date[0], "") error_message.append("Invalid Date") else: # set default values if nothing is present order_id, date = ("", "") # validate order_id and date here using re.match # add errors to error_message list: # error_message.append("Invalid Date") # calculate average price # first, compile a list of the non-zero prices nonzero_prices = [int(x) for x in (p1, p2, p3, p4) if int(x) > 0] # this can be done more efficient # compute the average price avg_price = sum(nonzero_prices) / len(nonzero_prices) # validate url using re here # handle errors as above print("\t".join([order_id, date, user_id, str(avg_price), url, ", ".join(error_message)]), file=out_file)

I did not add the re calls to validate the entries, as I do not know what exactly you expect to see in the entries. However, I added a comment where the call to re.match or something similar would be reasonable.

Источник

fnmatch — Unix filename pattern matching¶

This module provides support for Unix shell-style wildcards, which are not the same as regular expressions (which are documented in the re module). The special characters used in shell-style wildcards are:

Читайте также: Гаррис мод css контента

matches any single character

matches any character in seq

matches any character not in seq

For a literal match, wrap the meta-characters in brackets. For example, ‘[?]’ matches the character ‘?’ .

Note that the filename separator ( ‘/’ on Unix) is not special to this module. See module glob for pathname expansion ( glob uses filter() to match pathname segments). Similarly, filenames starting with a period are not special for this module, and are matched by the * and ? patterns.

Also note that functools.lru_cache() with the maxsize of 32768 is used to cache the compiled regex patterns in the following functions: fnmatch() , fnmatchcase() , filter() .

fnmatch. fnmatch ( filename , pattern ) ¶

Test whether the filename string matches the pattern string, returning True or False . Both parameters are case-normalized using os.path.normcase() . fnmatchcase() can be used to perform a case-sensitive comparison, regardless of whether that’s standard for the operating system.

This example will print all file names in the current directory with the extension .txt :

import fnmatch import os for file in os.listdir('.'): if fnmatch.fnmatch(file, '*.txt'): print(file)

Test whether filename matches pattern, returning True or False ; the comparison is case-sensitive and does not apply os.path.normcase() .

fnmatch. filter ( names , pattern ) ¶

Construct a list from those elements of the iterable names that match pattern. It is the same as [n for n in names if fnmatch(n, pattern)] , but implemented more efficiently.

fnmatch. translate ( pattern ) ¶

Return the shell-style pattern converted to a regular expression for using with re.match() .

>>> import fnmatch, re >>> >>> regex = fnmatch.translate(‘*.txt’) >>> regex ‘(?s:.*\\.txt)\\Z’ >>> reobj = re.compile(regex) >>> reobj.match(‘foobar.txt’)

Unix shell-style path expansion.

Источник

how to match file name in the file using python

How to find out if two file exists with same pattern inside a file.If all filenames have two-set of filenames ( csv.new and csv) then go ahead to next step otherwise exit with error message. The prefix «abc_package» will have two files one with extension «csv.new» and second file with extension «csv». There could be many filenames inside the «list_of_files.txt». Ex: List_of_files.txt

abc_package.1406728501.csv.new abc_package.1406728501.csv abc_package.1406724901.csv.new abc_package.1406724901.csv

Are the pairs always adjacent? And by «many» do you mean hundreds of millions (too much to fit in memory), or just like 50000?

yes file will be in given format. I meant around 15-20 filenames in the file not just 4 as given in the example.

3 Answers 3

For matching the file name name in python you can use fnmatch module..I will provide you a sample code from the documentation.

import fnmatch import os for file in os.listdir('.'): if fnmatch.fnmatch(file, '*.txt'): print file

The syntax would be fnmatch.fnmatchcase(filename, pattern)

Please have a look here for more examples

with open("in.txt","r") as fo: f = fo.readlines() cs_new = set() cs = set() for ele in f: ele = ele.rstrip() if not ele.endswith(".new"): cs.add(ele) else: cs_new.add(ele.split(".new")[0]) diff = cs ^ cs_new for fi in diff: print fi

As you need either filename you will need to check for the existence against both lists:

with open("in.txt","r") as f: f = [x.rstrip() for x in f] cs, cs_new, diff = [],[],[] for ind, ele in enumerate(f): if ele.endswith(".csv"): cs.append(ele) else: cs_new.append([ele.split(".new")[0],ind]) # keep track of original element in with the ind/index for ele in cs: if not any(ele in x for x in cs_new): diff.append(ele) for ele in cs_new: if not any(ele[0] in x for x in cs): diff.append(f[ele[1]]) # append original element with full extension

Thanks so much . It worked as expected but only thing how to print file names which does not have matching file in error message.

Why create an empty set and update it instead of just doing se = set(cs) ? For that matter, why not use a set comprehension in the first place, unless working in pre-2.7 Python is a requirement?

After I update code to below it worked but not in all cases like when add these 2 filename abc_package.1416728501.csv.new and abc_package.1426728501.csv. me = set() me.update(csn) print se.symmetric_difference(me)

Assuming the file isn’t so ridiculously huge that you can’t fit it into memory, just create a set of all .csv.new files and a set of all .csv files and verify that they’re identical. For example:

csvfiles = set() newfiles = set() with open('List_of_files.txt') as f: for line in f: line = line.rstrip() if line.endswith('.csv.new'): newfiles.add(line[:-4]) elif line.endswith('.csv'): csvfiles.add(line) if csvfiles != newfiles: raise ValueError('Mismatched files!')

If you want to know which files were mismatched, csvfiles — newfiles gives you the .csv files without corresponding .csv.new , and newfiles — csvfiles gives you the opposite.

(There are ways to make this cleaner and more readable, from using os.path.splitext to using a general partition-an-iterable-by-filter function, but I think this should be the easiest for a novice to immediately understand.)

Источник

How do i search directories and find files that match regex?

I recently started getting into Python and I am having a hard time searching through directories and matching files based on a regex that I have created. Basically I want it to scan through all the directories in another directory and find all the files that ends with .zip or .rar or .r01 and then run various commands based on what file it is.

import os, re rootdir = "/mnt/externa/Torrents/completed" for subdir, dirs, files in os.walk(rootdir): if re.search('(w?.zip)|(w?.rar)|(w?.r01)', files): print "match: " . files

w? optionally matches a literal w . . matches any character, including a dot. And without anchors, you match «a.rar.txt». To match zip or rar at the end, try: r'(\.zip|\.rar)$’

4 Answers 4

import os import re rootdir = "/mnt/externa/Torrents/completed" regex = re.compile('(.*zip$)|(.*rar$)|(.*r01$)') for root, dirs, files in os.walk(rootdir): for file in files: if regex.match(file): print(file)

CODE BELLOW ANSWERS QUESTION IN FOLLOWING COMMENT

That worked really well, is there a way to do this if match is found on regex group 1 and do this if match is found on regex group 2 etc ? – nillenilsson

import os import re regex = re.compile('(.*zip$)|(.*rar$)|(.*r01$)') rx = '(.*zip$)|(.*rar$)|(.*r01$)' for root, dirs, files in os.walk("../Documents"): for file in files: res = re.match(rx, file) if res: if res.group(1): print("ZIP",file) if res.group(2): print("RAR",file) if res.group(3): print("R01",file)

It might be possible to do this in a nicer way, but this works.

That worked really well, is there a way to do this if match is found on regex group 1 and do this if match is found on regex group 2 etc ?

Thanks a lot. I tried this pattern egex = re.compile(‘(w.*rar)|(s.*txt)|(\.py$)’, flags=re.IGNORECASE) . It works for the first two groups but doesn’t return the py files.

Given that you are a beginner, I would recommend using glob in place of a quickly written file-walking-regex matcher.

Snippets of functions using glob and a file-walking-regex matcher

The below snippet contains two file-regex searching functions (one using glob and the other using a custom file-walking-regex matcher). The snippet also contains a «stopwatch» function to time the two functions.

import os import sys from datetime import timedelta from timeit import time import os import re import glob def stopwatch(method): def timed(*args, **kw): ts = time.perf_counter() result = method(*args, **kw) te = time.perf_counter() duration = timedelta(seconds=te - ts) print(f": ") return result return timed @stopwatch def get_filepaths_with_oswalk(root_path: str, file_regex: str): files_paths = [] pattern = re.compile(file_regex) for root, directories, files in os.walk(root_path): for file in files: if pattern.match(file): files_paths.append(os.path.join(root, file)) return files_paths @stopwatch def get_filepaths_with_glob(root_path: str, file_regex: str): return glob.glob(os.path.join(root_path, file_regex))

Comparing runtimes of the above functions

On using the above two functions to find 5076 files matching the regex filename_*.csv in a dir called root_path (containing 66,948 files):

>>> glob_files = get_filepaths_with_glob(root_path, 'filename_*.csv') get_filepaths_with_glob: 0:00:00.176400 >>> oswalk_files = get_filepaths_with_oswalk(root_path,'filename_(.*).csv') get_filepaths_with_oswalk: 0:03:29.385379

The glob method is much faster and the code for it is shorter.

For your case

For your case, you can probably use something like the following to get your *.zip , *.rar and *.r01 files:

files = [] for ext in ['*.zip', '*.rar', '*.r01']: files += get_filepaths_with_glob(root_path, ext)

Источник