Thursday, 18 April 2024

Python MD5: incremental vs naive

Incremental hashing is way more easy on the RAM, expecially relevant when you have hundreds of threads doing MD5 in parallel. How much is it slower?

Naive

        paths = list(self._scan_files(self.root, max=1000))

        start = timeit.default_timer()

        for path in tqdm(paths):
            # md5 of path
            hash = hashlib.md5(path.read_bytes()).hexdigest()

        end = timeit.default_timer()
        print(f'Elapsed time: {end - start:.2f} seconds')
Elapsed time: 24.46 seconds

Incremental


        for path in tqdm(paths):
            # incremental md5 of path
            h = hashlib.md5()
            with path.open("rb") as file:
                for chunk in iter(lambda: file.read(4096), b""):
                    h.update(chunk)
            file_hash = h.hexdigest()
Elapsed time: 19.49 seconds

I can not really believe it, that it is really faster... Maybe _scan_files returned different files? But I did run it twice and it was the same. Incremental seems to be way better as

  • It is far mor easy on memory, especially with many parallel threads
  • Maybe faster??
  • More control: I could stream up the read bytes to a remote server and therefore read the file only once
When in doupt, take the incremental one, it is only a little bit more code to maintain.

No comments:

Post a Comment

Parse Wikipedia dump

""" This module processes Wikipedia dump files by extracting individual articles and parsing them into a structured format, ...