Python hash file in directory

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Python module and CLI for hashing of file system directories based on the Dirhash Standard.

License

andhus/dirhash-python

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Читайте также:  Тип данных array php

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

A lightweight python module and CLI for computing the hash of any directory based on its files’ structure and content.

  • Supports all hashing algorithms of Python’s built-in hashlib module.
  • Glob/wildcard («.gitignore style») path matching for expressive filtering of files to include/exclude.
  • Multiprocessing for up to 6x speed-up

The hash is computed according to the Dirhash Standard, which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations.

git clone git@github.com:andhus/dirhash-python.git pip install dirhash/ 
from dirhash import dirhash dirpath = "path/to/directory" dir_md5 = dirhash(dirpath, "md5") pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"]) no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"])
dirhash path/to/directory -a md5 dirhash path/to/directory -a md5 --match "*.py" dirhash path/to/directory -a sha1 --ignore ".*" ".*/" 

If you (or your application) need to verify the integrity of a set of files as well as their name and location, you might find this useful. Use-cases range from verification of your image classification dataset (before spending GPU-$$$ on training your fancy Deep Learning model) to validation of generated files in regression-testing.

There isn’t really a standard way of doing this. There are plenty of recipes out there (see e.g. these SO-questions for linux and python) but I couldn’t find one that is properly tested (there are some gotcha:s to cover!) and documented with a compelling user interface. dirhash was created with this as the goal.

checksumdir is another python module/tool with similar intent (that inspired this project) but it lacks much of the functionality offered here (most notably including file names/structure in the hash) and lacks tests.

The python hashlib implementation of common hashing algorithms are highly optimised. dirhash mainly parses the file tree, pipes data to hashlib and combines the output. Reasonable measures have been taken to minimize the overhead and for common use-cases, the majority of time is spent reading data from disk and executing hashlib code.

The main effort to boost performance is support for multiprocessing, where the reading and hashing is parallelized over individual files.

As a reference, let’s compare the performance of the dirhash CLI with the shell command:

find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5

which is the top answer for the SO-question: Linux: compute a single hash for a given folder & contents? Results for two test cases are shown below. Both have 1 GiB of random data: in «flat_1k_1MB», split into 1k files (1 MiB each) in a flat structure, and in «nested_32k_32kB», into 32k files (32 KiB each) spread over the 256 leaf directories in a binary tree of depth 8.

Implementation Test Case Time (s) Speed up
shell reference flat_1k_1MB 2.29 -> 1.0
dirhash flat_1k_1MB 1.67 1.36
dirhash (8 workers) flat_1k_1MB 0.48 4.73
shell reference nested_32k_32kB 6.82 -> 1.0
dirhash nested_32k_32kB 3.43 2.00
dirhash (8 workers) nested_32k_32kB 1.14 6.00

The benchmark was run a MacBook Pro (2018), further details and source code here.

Please refer to dirhash -h , the python source code and the Dirhash Standard.

Источник

filehash 0.2.dev1

Module and command-line tool that wraps around hashlib and zlib to facilitate generating checksums / hashes of files and directories.

Ссылки проекта

Статистика

Метаданные

Лицензия: MIT License (MIT)

Сопровождающие

Классификаторы

Описание проекта

Python module to facilitate calculating the checksum or hash of a file. Tested against Python 2.7.x, Python 3.6.x, Python 3.7.x, Python 3.8.x, Python 3.9.x, Python 3.10.x, PyPy 2.7.x and PyPy3 3.7.x. Currently supports Adler-32, BLAKE2b, BLAKE2s, CRC32, MD5, SHA-1, SHA-224, SHA-256, SHA-384 and SHA-512.

(Note: BLAKE2b and BLAKE2s are only supported on Python 3.6.x and later.)

FileHash class

The FileHash class wraps around the hashlib (provides hashing for MD5, SHA-1, SHA-224, SHA-256, SHA-384 and SHA-512) and zlib (provides checksums for Adler-32 and CRC32) modules and contains the following methods:

  • hash_file(filename) — Calculate the file hash for a single file. Returns a string with the hex digest.
  • hash_files(filename) — Calculate the file hash for multiple files. Returns a list of tuples where each tuple contains the filename and the calculated hash.
  • hash_dir(path, pattern=’*’) — Calculate the file hashes for an entire directory. Returns a list of tuples where each tuple contains the filename and the calculated hash.
  • cathash_files(filenames) — Calculate a single hash for multiple files. Files are sorted by their individual hash values and then traversed in that order to generate a combined hash value. Returns a string with the hex digest.
  • cathash_dir(path, pattern=’*’) — Calculate a single hash for an entire directory of files. Files are sorted by their individual hash values and then traversed in that order to generate a combined hash value. Returns a string with the hex digest.
  • verify_sfv(sfv_filename) — Reads the specified SFV (Simple File Verification) file and calculates the CRC32 checksum for the files listed, comparing the calculated CRC32 checksums against the specified expected checksums. Returns a list of tuples where each tuple contains the filename and a boolean value indicating if the calculated CRC32 checksum matches the expected CRC32 checksum. To find out more about SFV files, see the Simple file verification entry in Wikipedia.
  • verify_checksums(checksum_filename) — Reads the specified file and calculates the hashes for the files listed, comparing the calculated hashes against the specified expected hashes. Returns a list of tuples where each tuple contains the filename and a boolean value indicating if the calculated hash matches the expected hash.

For the checksum file, the file is expected to be a plain text file where each line has an entry formatted as follows:

This format is the format used by programs such as the sha1sum family of tools for generating checksum files. Here is an example generated by sha1sum :

f7ef3b7afaf1518032da1b832436ef3bbfd4e6f0 *lorem_ipsum.txt 03da86258449317e8834a54cf8c4d5b41e7c7128 *lorem_ipsum.zip

The FileHash constructor has two optional arguments:

  • hash_algorithm=’sha256′ — Specifies the hashing algorithm to use. See filehash.SUPPORTED_ALGORITHMS for the list of supported hash / checksum algorithms. Defaults to SHA256.
  • chunk_size=4096 — Integer specifying the chunk size to use (in bytes) when reading the file. This comes in useful when processing very large files to avoid having to read the entire file into memory all at once. Default chunk size is 4096 bytes.

Example usage

The library can be used as follows:

>>> import os >>> from filehash import FileHash >>> md5hasher = FileHash('md5') >>> md5hasher.hash_file("./testdata/lorem_ipsum.txt") '72f5d9e3a5fa2f2e591487ae02489388' >>> sha1hasher = FileHash('sha1') >>> sha1hasher.hash_dir("./testdata", "*.zip") [FileHashResult(filename='lorem_ipsum.zip', hash='03da86258449317e8834a54cf8c4d5b41e7c7128')] >>> sha512hasher = FileHash('sha512') >>> os.chdir("./testdata") >>> sha512hasher.verify_checksums("./hashes.sha512") [VerifyHashResult(filename='lorem_ipsum.txt', hashes_match=True), VerifyHashResult(filename='lorem_ipsum.zip', hashes_match=True)] >>> crc32hasher = FileHash('crc32') >>> crc32hasher.verify_sfv("./lorem_ipsum.sfv") [VerifyHashResult(filename='lorem_ipsum.txt', hashes_match=True), VerifyHashResult(filename='lorem_ipsum.zip', hashes_match=True)]

chkfilehash command line tool

A command-line tool called chkfilehash is also included with the filehash package. Here is an example of how the tool can be used:

$ chkfilehash -a sha512 -c hashes.sha512 lorem_ipsum.txt: OK lorem_ipsum.zip: OK $ chkfilehash -a crc32 lorem_ipsum.zip 7425D3BE *lorem_ipsum.zip $

Run the tool without any parameters or with the -h / —help switch to get a usage screen.

License

This is released under an MIT license. See the LICENSE file in this repository for more information.

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

swacad/Hash_Dir

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

This repository holds my hash_dir project. There are currently three python programs in the repository.

  1. hash_file will hash a single file and print the hexdigest to the screen.
  2. hash_dir imports the hash_file function from hash_file and recursively hashes all of the files in a directory to include hidden files and files in its subdirectories. The results and printed to a csv file which will include a timestamp and the hostname the program was run on as part of the file name. The csv file will be saved to the same directory where hash_dir.py is run from.
  3. hash_diff will take two csv files generated from the hash_dir program and output a diff report as a text file. The diff file will report which files are new, modified, deleted, and unchanged.

This repository contains a standalone executable version of both hash_dir and hash_file for Microsoft Windows. You can use the executable to run the program without installing Python on the host machine.

To run directly from command line use the following format:

python hash_file.py FILE_PATH ALGORITHM python hash_dir.py DIR_PATH ALGORITHM FILE_PATH is the path to the file. DIR_PATH is the directory path. ALGORITHM is the the hashing algorithm to be used and must be one of the following: ‘md5’ ‘sha1’ ‘sha224’ ‘sha256’ ‘sha384’ ‘sha512’ EXAMPLES: python hash_dir.py / md5 1 (Hash all files less 1MB or less in the root directory using MD5) python hash_dir.py c:\ sha256 100 True (Hash all files less 100MB or less in the root directory using SHA256 and output hashed files to console) python hash_dir.py «c:\program files» sha1 1000 (Hash all files less 1000MB or less in the root directory using SHA1) python hash_diff.py

Hashing small directories will be nearly instantaneous. Hashing large directories such as the C:\Windows\System32\ will probably take between 10-20 seconds depending on how fast your machine is. Hashing entire root directories can take 20 minutes or longer.

The csv file outputs will be proportionate to the number of files hashed. This is not a problem for small directories but hashing large directories will create large csv files. Make sure you have enough space for your files.

Some files cannot be read because they protected. This can occur even when running with administrator credentials. These files will have a «bad hash» value of the string «bad_hash» hashed with the chosen algorithm.

Источник

Оцените статью