Find files in a directory containing desired string in Python
I’m trying to find a string in files contained within a directory. I have a string like banana that I know that exists in a few of the files.
import os import sys user_input = input("What is the name of you directory?") directory = os.listdir(user_input) searchString = input("What word are you trying to find?") for fname in directory: # change directory as needed if searchString in fname: f = open(fname,'r') print('found string in file %s') %fname else: print('string not found')
When the program runs, it just outputs string not found for every file. There are three files that contain the word banana , so the program isn’t working as it should. Why isn’t it finding the string in the files?
alright, pasted it. i just added a print fname on the for loop, and i got this output: What is the name of your directory?example What word are you trying to find?banana 1.txt string not found 2.txt string not found 3.txt string not found
5 Answers 5
You are trying to search for string in filename , use open(filename, ‘r’).read() :
import os user_input = input('What is the name of your directory') directory = os.listdir(user_input) searchstring = input('What word are you trying to find?') for fname in directory: if os.path.isfile(user_input + os.sep + fname): # Full path f = open(user_input + os.sep + fname, 'r') if searchstring in f.read(): print('found string in file %s' % fname) else: print('string not found') f.close()
We use user_input + os.sep + fname to get full path.
os.listdir gives files and directories names, so we use os.path.isfile to check for files.
Ohh i see i was searching for string in filename. Thanks zetysz, that makes sense. But i am getting an error: File «C:\Users\XX\Desktop\python exercises\practice.py», line 12, in
i’m not sure. for the user_input part when i ran the program, i typed ‘example’ and then for the searchstring, the string i was looking for.
Again that makes sense. I ran the program now with your edit. But it just asks me for two inputs, then it outputs nothing else.
Here is another version using the Path module from pathlib instead of os.
def search_in_file(path,searchstring): with open(path, 'r') as file: if searchstring in file.read(): print(f' found string in file ') else: print('string not found')
from pathlib import Path user_input = input('What is the name of your directory') searchstring = input('What word are you trying to find?') dir_content = sorted(Path(user_input).iterdir()) for path in dir_content: if not path.is_dir(): search_in_file(path, searchstring)
This is my solution for the problem. It comes with the feature of also checking in sub-directories, as well as being able to handle multiple file types. It is also quite easy to add support for other ones. The downside is of course that it’s quite chunky code. But let me know what you think.
import os import docx2txt from pptx import Presentation import pdfplumber def findFiles(strings, dir, subDirs, fileContent, fileExtensions): # Finds all the files in 'dir' that contain one string from 'strings'. # Additional parameters: # 'subDirs': True/False : Look in sub-directories of your folder # 'fileContent': True/False :Also look for the strings in the file content of every file # 'fileExtensions': True/False : Look for a specific file extension -> 'fileContent' is ignored filesInDir = [] foundFiles = [] filesFound = 0 if not subDirs: for filename in os.listdir(dir): if os.path.isfile(os.path.join(dir, filename).replace("\\", "/")): filesInDir.append(os.path.join(dir, filename).replace("\\", "/")) else: for root, subdirs, files in os.walk(dir): for f in files: if not os.path.isdir(os.path.join(root, f).replace("\\", "/")): filesInDir.append(os.path.join(root, f).replace("\\", "/")) print(filesInDir) # Find files that contain the keyword if filesInDir: for file in filesInDir: print("Current file: "+file) # Define what is to be searched in filename, extension = os.path.splitext(file) if fileExtensions: fileText = extension else: fileText = os.path.basename(filename).lower() if fileContent: fileText += getFileContent(file).lower() # Check for translations for string in strings: print(string) if string in fileText: foundFiles.append(file) filesFound += 1 break return foundFiles def getFileContent(filename): '''Returns the content of a file of a supported type (list: supportedTypes)''' if filename.partition(".")[2] in supportedTypes: if filename.endswith(".pdf"): content = "" with pdfplumber.open(filename) as pdf: for x in range(0, len(pdf.pages)): page = pdf.pages[x] content = content + page.extract_text() return content elif filename.endswith(".txt"): with open(filename, 'r') as f: content = "" lines = f.readlines() for x in lines: content = content + x f.close() return content elif filename.endswith(".docx"): content = docx2txt.process(filename) return content elif filename.endswith(".pptx"): content = "" prs = Presentation(filename) for slide in prs.slides: for shape in slide.shapes: if hasattr(shape, "text"): content = content+shape.text return content else: return "" supportedTypes = ["txt", "docx", "pdf", "pptx"] print(findFiles(strings=["buch"], dir="C:/Users/User/Desktop/", subDirs=True, fileContent=True, fileExtensions=False))
Check if a given directory contains any directory in python
Essentially, I’m wondering if the top answer given to this question can be implemented in Python. I am reviewing the modules os, os.path, and shutil and I haven’t yet been able to find an easy equivalent, though I assume I’m just missing something simple. More specifically, say I have a directory A, and inside directory A is any other directory. I can call os.walk(‘path/to/A’) and check if dirnames is empty, but I don’t want to make the program go through the entire tree rooted at A; i.e. what I’m looking for should stop and return true as soon as it finds a subdirectory. For clarity, on a directory containing files but no directories an acceptable solution will return False.
4 Answers 4
def folders_in(path_to_parent): for fname in os.listdir(path_to_parent): if os.path.isdir(os.path.join(path_to_parent,fname)): yield os.path.join(path_to_parent,fname) print(list(folders_in("/path/to/parent")))
this will return a list of all subdirectories . if its empty then there are no subdirectories
set([os.path.dirname(p) for p in glob.glob("/path/to/parent/*/*")])
although for a subdirectory to be counted with this method it must have some file in it
def subfolders(path_to_parent): try: return next(os.walk(path_to_parent))[1] except StopIteration: return []
I would just do as follows:
#for example dir_of_interest = "/tmp/a/b/c" print(dir_of_interest in (v[0] for v in os.walk("/tmp/")))
This prints True or False, depending if dir_of_interest is in the generator. And you use here generator, so the directories to check are generated one by one.
You can break from the walk anytime you want. For example, this brakes is a current folder being walked, has no subdirectories:
for root, dirs, files in os.walk("/tmp/"): print(root,len(dirs)) if not len(dirs): break
Maybe this is in line with what you are after.
Thats why I asked for the code you have, or the example input data (like dummy directory structure) and the output. Give me few minutes to think about the edited question.
@JoranBeasley Possible. Its not very clear. And by looking at other anwser, it seems as a guessing game starts what OP wants.
@Nick When you see so many different answers and comments about a question, than either questions is not clear, or it too broad, or there is no example in the question that OP tried to do anything to solve it, so that we can correct the specific issue with the code.
#!/usr/local/cpython-3.4/bin/python import glob import os top_of_hierarchy = '/tmp/' #top_of_hierarchy = '/tmp/orbit-dstromberg' pattern = os.path.join(top_of_hierarchy, '*') for candidate in glob.glob(pattern): if os.path.isdir(candidate): print(" is a directory".format(candidate)) break else: print('No directories found') # Tested on 2.6, 2.7 and 3.4
I apparently can’t comment yet; however, I wanted to update part of the answer https://stackoverflow.com/users/541038/joran-beasley gave, or at least what worked for me.
Using python3 (3.7.3), I had to modify his first code snippet as follows:
import os def has_folders(path_to_parent): for fname in os.listdir(path_to_parent): if os.path.isdir(os.path.join(path_to_parent, fname)): yield os.path.join(path_to_parent, fname) print(list(has_folders("/repo/output")))
Further progress on narrowing to «does given directory contain any directory» results in code like:
import os def folders_in(path_to_parent): for fname in os.listdir(path_to_parent): if os.path.isdir(os.path.join(path_to_parent, fname)): yield os.path.join(path_to_parent, fname) def has_folders(path_to_parent): folders = list(folders_in(path_to_parent)) return len(folders) != 0 print(has_folders("the/path/to/parent"))
The result of this code should be True or False