Python is same files

See if two files are the same file

The requirement: to know when two files are the same file. This is not the same as testing their contents / size / checksum for equality; rather it is seeing if, in spite of their apparently being on different drives or in different directories, they are in fact the same file, obscured by drive mappings or hard links.

This was surprisingly difficult to work out: you’d have thought there would have been a FilesAreIdentical API call of some sort. But searching turned nothing up. (This is where I get 25 emails from knowledgeable people telling me that there is one, after all). The technique I’m going to use here, though, comes with a Microsoft seal of approval. It uses the GetFileInformationByHandle API call to return a volume serial number and a file index number, valid while the file is open. Note that last point: if you open one file, get its index, and close it, and then open a completely unrelated file, to get its index, you might get the same index both times even though the files are not the same.

I’ve broken the code down slightly into three functions just because of the long-windedness of the two API calls used. It could all be done in one place, obviously. This code has been tested on SUBST drives, network-mapped drives, hard links and mounted volumes.

Читайте также:  Java как коллекции используются

Update: Richard Philips points out that if you specify the FILE_FLAG_BACKUP_SEMANTICS option to CreateFile, you can use the same code to check identity for directories as well as for files.

import os, sys import tempfile import win32file def get_read_handle (filename): if os.path.isdir(filename): dwFlagsAndAttributes = win32file.FILE_FLAG_BACKUP_SEMANTICS else: dwFlagsAndAttributes = 0 return win32file.CreateFile ( filename, win32file.GENERIC_READ, win32file.FILE_SHARE_READ, None, win32file.OPEN_EXISTING, dwFlagsAndAttributes, None ) def get_unique_id (hFile): ( attributes, created_at, accessed_at, written_at, volume, file_hi, file_lo, n_links, index_hi, index_lo ) = win32file.GetFileInformationByHandle (hFile) return volume, index_hi, index_lo def files_are_equal (filename1, filename2): hFile1 = get_read_handle (filename1) hFile2 = get_read_handle (filename2) are_equal = (get_unique_id (hFile1) == get_unique_id (hFile2)) hFile2.Close () hFile1.Close () return are_equal # # This bit of the example will only work on Win2k+; it # was the only way I could reasonably produce two different # files which were the same file, without knowing anything # about your drives, network etc. # filename1 = sys.executable filename2 = tempfile.mktemp (".exe") win32file.CreateHardLink (filename2, filename1, None) print filename1, filename2, files_are_equal (filename1, filename2)

Источник

How to compare files in Python

The filecmp module in python can be used to compare files and directories. 1.

filecmp Compares the files file1 and file2 and returns True if identical, False if not. By default, files that have identical attributes as returned by os.stat() are considered to be equal. If shallow is not provided (or is True), files that have the same stat signature are considered equal.

cmpfiles(dir1, dir2, common[, shallow])

Compares the contents of the files contained in the list common in the two directories dir1 and dir2. cmpfiles returns a tuple containing three lists — match, mismatch, errors of filenames.

  • match — lists the files that are the same in both directories.
  • mismatch — lists the files that dont match.
  • errors — lists the files that could not be compared for some reason.
dircmp(dir1, dir2 [, ignore[, hide]])

Creates a directory comparison object that can be used to perform various comparison operations on the directories dir1 and dir2.

  • ignore — ignores a list of filenames to ignore, default value of [‘RCS’,’CVS’,’tags’].
  • hide — list of filenames to hide, defaults list [os.curdir, os.pardir] ([‘.’, ‘..’] on UNIX.

Instances of filecmp.dircmp implement the following methods that print elaborated reports to sys.stdout:

  • report() : Prints a comparison between the two directories.
  • report_partial_closure() : Prints a comparison of the two directories as well as of the immediate subdirectories of the two directories.
  • report_full_closure() :Prints a comparison of the two directories, all of their subdirectories, all the subdirectories of those subdirectories, and so on (i.e., recursively).
  • left_list: files and subdirectories found in directory path1, not including elements of hidelist.
  • right_list: files and subdirectories found in directory path2, not including elements of hidelist.
  • common: files and subdirectories that are in both directory path1 and directory path2.
  • left_only: files and subdirectories that are in directory path1 only.
  • right_only: files and subdirectories that are in directory path2 only.
  • common_dirs: subdirectories that are in both directory path1 and directory path2.
  • common_files: files that are in both directory path1 and directory path2.
  • same_files: Paths to files whose contents are identical in both directory path1 and directory path2.
  • diff_files: Paths to files that are in both directory path1 and directory path2 but whose contents differ.
  • funny_files: paths to files that are in both directory path1 and directory path2 but could not be compared for some reason.
  • subdirs: A dictionary that maps names in common_dirs to dircmp objects.

Preparing test data for comparsion.

import os # prepare test data def makefile(filename,text=None): """ Function: make some files params : input file, body """ with open(filename, 'w') as f: f.write(text or filename) return # prepare test data def makedirectory(directory_name): """ Function: make directories params : input directory """ if not os.path.exists(directory_name): os.mkdir(directory_name) # Get current working directory present_directory = os.getcwd() # change to directory provided os.chdir(directory_name) # Make two directories os.mkdir('dir1') os.mkdir('dir2') # Make two same subdirectories os.mkdir('dir1/common_dir') os.mkdir('dir2/common_dir') # Make two different subdirectories os.mkdir('dir1/dir_only_in_dir1') os.mkdir('dir2/dir_only_in_dir2') # Make a unqiue file one each in directory makefile('dir1/file_only_in_dir1') makefile('dir2/file_only_in_dir2') # Make a unqiue file one each in directory makefile('dir1/common_file', 'Hello, Writing Same Content') makefile('dir2/common_file', 'Hello, Writing Same Content') # Make a non unqiue file one each in directory makefile('dir1/not_the_same') makefile('dir2/not_the_same') makefile('dir1/file_in_dir1', 'This is a file in dir1') os.mkdir('dir2/file_in_dir1') os.chdir(present_directory) return if __name__ == '__main__': os.chdir(os.getcwd()) makedirectory('example') makedirectory('example/dir1/common_dir') makedirectory('example/dir2/common_dir')
  • filecmp example Running filecmp example. The shallow argument tells cmp() whether to look at the contents of the file, in addition to its metadata.

The default is to perform a shallow comparison using the information available from os.stat(). If the results are the same, the files are considered the same. Thus, files of the same size that were created at the same time are reported as the same, even if their contents differ.

When shallow is False, the contents of the file are always compared.

import filecmp print('Output \n *** Common File :', end=' ') print(filecmp.cmp('example/dir1/common_file', 'example/dir2/common_file'), end=' ') print(filecmp.cmp('example/dir1/common_file', 'example/dir2/common_file', shallow=False)) print(' *** Different Files :', end=' ') print(filecmp.cmp('example/dir1/not_the_same', 'example/dir2/not_the_same'), end=' ') print(filecmp.cmp('example/dir1/not_the_same', 'example/dir2/not_the_same', shallow=False)) print(' *** Identical Files :', end=' ') print(filecmp.cmp('example/dir1/file_only_in_dir1', 'example/dir1/file_only_in_dir1'), end=' ') print(filecmp.cmp('example/dir1/file_only_in_dir1', 'example/dir1/file_only_in_dir1', shallow=False))

Output

*** Common File : True True *** Different Files : False False *** Identical Files : True True

Use cmpfiles() to compare a set of files in two directories without recursing.

import filecmp import os # Determine the items that exist in both directories. dir1_contents = set(os.listdir('example/dir1')) dir2_contents = set(os.listdir('example/dir2')) common = list(dir1_contents & dir2_contents) common_files = [f for f in common if os.path.isfile(os.path.join('example/dir1', f))] print(f' *** Common files are : ') # Now, let us compare the directories match, mismatch, errors = filecmp.cmpfiles( 'example/dir1', 'example/dir2', common_files,) print(f' *** Matched files are : ') print(f' *** mismatch files are : ') print(f' *** errors files are : ')
*** Common files are : ['file_in_dir1', 'not_the_same', 'common_file'] *** Matched files are : ['common_file'] *** mismatch files are : ['file_in_dir1', 'not_the_same'] *** errors files are : []
import filecmp dc = filecmp.dircmp('example/dir1', 'example/dir2') print(f"output \n *** Printing detaile report: \n ") print(dc.report()) print(f"\n") print(dc.report_full_closure())

Output

*** Printing detaile report: diff example/dir1 example/dir2 Only in example/dir1 : ['dir_only_in_dir1', 'file_only_in_dir1'] Only in example/dir2 : ['dir_only_in_dir2', 'file_only_in_dir2'] Identical files : ['common_file'] Differing files : ['not_the_same'] Common subdirectories : ['common_dir'] Common funny cases : ['file_in_dir1'] None diff example/dir1 example/dir2 Only in example/dir1 : ['dir_only_in_dir1', 'file_only_in_dir1'] Only in example/dir2 : ['dir_only_in_dir2', 'file_only_in_dir2'] Identical files : ['common_file'] Differing files : ['not_the_same'] Common subdirectories : ['common_dir'] Common funny cases : ['file_in_dir1'] diff example/dir1\common_dir example/dir2\common_dir Common subdirectories : ['dir1', 'dir2'] diff example/dir1\common_dir\dir1 example/dir2\common_dir\dir1 Identical files : ['common_file', 'file_in_dir1', 'file_only_in_dir1', 'not_the_same'] Common subdirectories : ['common_dir', 'dir_only_in_dir1'] diff example/dir1\common_dir\dir1\common_dir example/dir2\common_dir\dir1\common_dir diff example/dir1\common_dir\dir1\dir_only_in_dir1 example/dir2\common_dir\dir1\dir_only_in_dir1 diff example/dir1\common_dir\dir2 example/dir2\common_dir\dir2 Identical files : ['common_file', 'file_only_in_dir2', 'not_the_same'] Common subdirectories : ['common_dir', 'dir_only_in_dir2', 'file_in_dir1'] diff example/dir1\common_dir\dir2\common_dir example/dir2\common_dir\dir2\common_dir diff example/dir1\common_dir\dir2\dir_only_in_dir2 example/dir2\common_dir\dir2\dir_only_in_dir2 diff example/dir1\common_dir\dir2\file_in_dir1 example/dir2\common_dir\dir2\file_in_dir1 None

You can further try all the commands mentioned in Point1 to see how each method behaves.

Источник

Compare two files using Hashing in Python

In this article, we would be creating a program that would determine, whether the two files provided to it are the same or not. By the same means that their contents are the same or not (excluding any metadata). We would be using Cryptographic Hashes for this purpose. A cryptographic hash function is a function that takes in input data and produces a statistically unique output, which is unique to that particular set of data. We would be using this property of Cryptographic hash functions to identify the contents of two files, and then would compare that to determine whether they are the same or not.

Note: The probability of getting the same hash for two different data set is very very low. And even then the good cryptographic hash functions are made so that hash collisions are accidental rather than intentional.

We would be using SHA256 (Secure hash algorithm 256) as a hash function in this program. SHA256 is very resistant to collisions. We would be using hashlib library’s sha256() to use the implementation of the function in python.
hashlib module is preinstalled in most python distributions. If it doesn’t exists in your environment, then you can get the module by running the following command in the command–

Below is the implementation.
Text File 1:

Text File 2:

Источник

Оцените статью