- Generating an MD5 checksum of a file
- 9 Answers 9
- Create MD5 Hash of a file in Python
- Create MD5 hash of a file in Python
- Incorrect Way to create MD5 Hash of a file in Python
- Correct Way to create MD5 Hash of a file in Python
- MD5 Hash of Large Files in Python
- Compare and Verify MD5 hash of a file using python
Generating an MD5 checksum of a file
Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I’m working on, and I’d like to confirm the checksums of the files).
@kennytm The link you provided says this in the second paragraph: «The underlying MD5 algorithm is no longer deemed secure» while describing md5sum . That is why security-conscious programmers should not use it in my opinion.
@Debug255 Good and valid point. Both md5sum and the technique described in this SO question should be avoided — it’s better to use SHA-2 or SHA-3, if possible: en.wikipedia.org/wiki/Secure_Hash_Algorithms
Might be worth mentioning there are still valid reasons to use md5 that are not affected by it’s brokenness for security purposes. (eg checking for bit rot in a system that uses baked in md5 creation during archival)
9 Answers 9
Note that sometimes you won’t be able to fit the whole file in memory. In that case, you’ll have to read chunks of 4096 bytes sequentially and feed them to the md5 method:
import hashlib def md5(fname): hash_md5 = hashlib.md5() with open(fname, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): hash_md5.update(chunk) return hash_md5.hexdigest()
Note: hash_md5.hexdigest() will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest() , so you don’t have to convert back.
@alper no it doesn’t — sorry to put it so flippantly-sounding, but there is no way that md5 differs for the same input — if you’re reading binary (not line-ending-agnostic) input, then this algorithm is deterministic — md5’s famous problem is that it might FAIL TO DIFFER for two different inputs
@rsandwick3 As I understand md5 formula may end up generate same output for the two different inputs ?
There is a way that’s pretty memory inefficient.
import hashlib def file_as_bytes(file): with file: return file.read() print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()
[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]
Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here’s what you should do instead:
[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]
If you only want 128 bits worth of digest you can do .digest()[:16] .
This will give you a list of tuples, each tuple containing the name of its file and its hash.
Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you’re not using MD5 for ‘cryptographic’ purposes, you’re fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It’s best to just get in the habit of using the right algorithm out of the gate. It’s just typing a different bunch of letters is all. It’s not that hard.
Here is a way that is more complex, but memory efficient:
import hashlib def hash_bytestr_iter(bytesiter, hasher, ashexstr=False): for block in bytesiter: hasher.update(block) return hasher.hexdigest() if ashexstr else hasher.digest() def file_as_blockiter(afile, blocksize=65536): with afile: block = afile.read(blocksize) while len(block) > 0: yield block block = afile.read(blocksize) [(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5())) for fname in fnamelst]
And, again, since MD5 is broken and should not really ever be used anymore:
[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256())) for fname in fnamelst]
Again, you can put [:16] after the call to hash_bytestr_iter(. ) if you only want 128 bits worth of digest.
@GregS, @TheLifelessOne — Yeah, and next thing you know someone finds a way to use this fact about your application to cause a file to be accepted as uncorrupted when it isn’t the file you’re expecting at all. No, I stand by my scary warnings. I think MD5 should be removed or come with deprecation warnings.
I’d probably use .hexdigest() instead of .digest() — it’s easier for humans to read — which is the purpose of OP.
I used this solution but it uncorrectly gave the same hash for two different pdf files. The solution was to open the files by specifing binary mode, that is: [(fname, hashlib.md5(open(fname, ‘rb’).read()).hexdigest()) for fname in fnamelst] This is more related to the open function than md5 but I thought it might be useful to report it given the requirement for cross-platform compatibility stated above (see also: docs.python.org/2/tutorial/…).
I’m clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear — anyway, specifically to answer @Nemo’s question from Omnifarious’s answer:
I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you’d expect. Taking the fastest (but pretty typical) timeit.timeit or /usr/bin/time result from each of several methods of checksumming a file of approx. 11MB:
$ ./sum_methods.py crc32_mmap(filename) 0.0241742134094 crc32_read(filename) 0.0219960212708 subprocess.check_output(['cksum', filename]) 0.0553209781647 md5sum_mmap(filename) 0.0286180973053 md5sum_read(filename) 0.0311000347137 subprocess.check_output(['md5sum', filename]) 0.0332629680634 $ time md5sum /tmp/test.data.300k d3fe3d5d4c2460b5daacc30c6efbc77f /tmp/test.data.300k real 0m0.043s user 0m0.032s sys 0m0.010s $ stat -c '%s' /tmp/test.data.300k 11890400
So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. The relevant md5sum function ( md5sum_read in the above listing) is pretty similar to Omnifarious’s:
import hashlib def md5sum(filename, blocksize=65536): hash = hashlib.md5() with open(filename, "rb") as f: for block in iter(lambda: f.read(blocksize), b""): hash.update(block) return hash.hexdigest()
Granted, these are from single runs (the mmap ones are always a smidge faster when at least a few dozen runs are made), and mine’s usually got an extra f.read(blocksize) after the buffer is exhausted, but it’s reasonably repeatable and shows that md5sum on the command line is not necessarily faster than a Python implementation.
EDIT: Sorry for the long delay, haven’t looked at this in some time, but to answer @EdRandall’s question, I’ll write down an Adler32 implementation. However, I haven’t run the benchmarks for it. It’s basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32() call:
import zlib def adler32sum(filename, blocksize=65536): checksum = zlib.adler32("") with open(filename, "rb") as f: for block in iter(lambda: f.read(blocksize), b""): checksum = zlib.adler32(block, checksum) return checksum & 0xffffffff
Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for «» , which is 1 — CRC can start with 0 instead. The AND -ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions.
Create MD5 Hash of a file in Python
As a Python enthusiast, I’m always on the lookout for handy tools and techniques that can streamline my development process. One such technique that I find particularly useful is generating MD5 hashes of files. Whether you’re ensuring data integrity, verifying file integrity during transmission, or simply looking to add an extra layer of security, MD5 hashes can be invaluable. In this blog post, I’m excited to guide you through the process of creating an MD5 hash of a file in Python. So, let’s dive in and unlock the power of file hashing!
MD5 is (atleast when it was created) a standardized 1-way function that takes in data input of any form and maps it to a fixed-size output string, irrespective of the size of the input string.
Though it is used as a cryptographygraphic hash function, it has been found to suffer from a lot of vulnerabilities.
The hash function generates the same output hash for the same input string. This means that, you can use this string to validate files or text or anything when you pass it across the network or even otherwise. MD5 can act as a stamp or for checking if the data is valid or not.
Input String | Output Hash |
---|---|
hi | 49f68a5c8493ec2c0bf489821c21fc3b |
debugpointer | d16220bc73b8c7176a3971c7f73ac8aa |
computer science is amazing! I love it. | f3c5a497380310d828cdfc1737e8e2a3 |
Check this out — If you are looking for MD5 hash of a String.
Create MD5 hash of a file in Python
MD5 hash can be created using the python’s default module hashlib .
Incorrect Way to create MD5 Hash of a file in Python
But, you have to note that you cannot create a hash of a file by just specifying the name of the file like this-
# this is NOT correct import hashlib print(hashlib.md5("filename.jpg".encode('UTF-8')).hexdigest())
03e6eda992afdeda6b2acaed17722515
The above value is NOT the MD5 hash of the file. But, it is the MD5 hash of the string filename.jpg itself.
Correct Way to create MD5 Hash of a file in Python
You have to read the contents of the file to create MD5 hash of the file itself. It’s simple, we can just read the contents of the file and create the hash.
The process of creating an MD5 hash in python is very simple. First import hashlib, then encode your string that you want to hash i.e., converts the string into the byte equivalent using encode(), then pass it through the hashlib.md5() function. We print the hexdigest value of the hash m , which is the hexadecimal equivalent encoded string.
import hashlib file_name = 'filename.jpg' with open(file_name) as f: data = f.read() md5hash = hashlib.md5(data).hexdigest()
MD5 Hash of Large Files in Python
In the above code, there is one problem. If the file is a 10 Gb file, let’s say a large log file or a dump of traffic or a Game like FIFA or others. If you want to compute MD5 hash of it, it would probably chew up your memory.
Here is a memory optimised way of computing MD5 hash, where we read chunks of 4096 bytes (can be customised as per your requirement, size of your system, size of your file etc.,). So, in this process we sequentially process the chunks and update the hash. So, in this process, let’s say there are 1000 such chunks of the file, the hash_md5 is updated 1000 times.
At the end we return the hexdigest value of the hash m , which is the hexadecimal equivalent encoded string.
import hashlib # A utility function that can be used in your code def compute_md5(file_name): hash_md5 = hashlib.md5() with open(file_name, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): hash_md5.update(chunk) return hash_md5.hexdigest()
Compare and Verify MD5 hash of a file using python
You need to verify the MD5 hash at the server or at some other point or logic in your code.
To verify the MD5 hash you will have to create the MD5 hash of the original file again.
Then compare the original MD5 value that the source has generated and MD5 that you generate.
import hashlib file_name = 'filename.jpg' original_md5 = '5d41402abc4b2a76b9719d911017c592' with open(file_name) as f: data = f.read() md5_returned = hashlib.md5(data).hexdigest() if original_md5 == md5_returned: print "MD5 verified." else: print "MD5 verification failed."
The process of MD5 creation and verification is easy as we discussed above. Happy Coding!
NOTE : Please do not use this to hash passwords and store it in your databases, prefer SHA-256 or SHA-512 or other superior cryptographygraphic hash functions for the same.
I’m glad that you found the content useful. Happy Coding.
We’ve reached the end of our journey through the world of file hashing using MD5 in Python. I hope this exploration has empowered you with the knowledge and skills to incorporate this powerful technique into your own projects. The ability to generate MD5 hashes of files not only enhances data security but also provides a means to validate the integrity of files. As you continue your Python coding adventures, remember the importance of data integrity and the role that MD5 hashes can play in achieving it. Keep coding, keep exploring, and keep harnessing the power of Python!