Notes on Programming: Python MD5: incremental vs naive

Thursday, 18 April 2024

Python MD5: incremental vs naive

Incremental hashing is way more easy on the RAM, expecially relevant when you have hundreds of threads doing MD5 in parallel. How much is it slower?

Naive

        paths = list(self._scan_files(self.root, max=1000))

        start = timeit.default_timer()

        for path in tqdm(paths):
            # md5 of path
            hash = hashlib.md5(path.read_bytes()).hexdigest()

        end = timeit.default_timer()
        print(f'Elapsed time: {end - start:.2f} seconds')

Elapsed time: 24.46 seconds

Incremental


        for path in tqdm(paths):
            # incremental md5 of path
            h = hashlib.md5()
            with path.open("rb") as file:
                for chunk in iter(lambda: file.read(4096), b""):
                    h.update(chunk)
            file_hash = h.hexdigest()

Elapsed time: 19.49 seconds

I can not really believe it, that it is really faster... Maybe _scan_files returned different files? But I did run it twice and it was the same. Incremental seems to be way better as

It is far mor easy on memory, especially with many parallel threads
Maybe faster??
More control: I could stream up the read bytes to a remote server and therefore read the file only once

When in doupt, take the incremental one, it is only a little bit more code to maintain.

Notes on Programming

Thursday, 18 April 2024

Python MD5: incremental vs naive

No comments:

Post a Comment

Parse Wikipedia dump

About Me

Blog Archive