Incremental hashing is way more easy on the RAM, expecially relevant when you have hundreds of threads doing MD5 in parallel. How much is it slower?
Naive
paths = list(self._scan_files(self.root, max=1000)) start = timeit.default_timer() for path in tqdm(paths): # md5 of path hash = hashlib.md5(path.read_bytes()).hexdigest() end = timeit.default_timer() print(f'Elapsed time: {end - start:.2f} seconds')
Elapsed time: 24.46 seconds
Incremental
for path in tqdm(paths): # incremental md5 of path h = hashlib.md5() with path.open("rb") as file: for chunk in iter(lambda: file.read(4096), b""): h.update(chunk) file_hash = h.hexdigest()
Elapsed time: 19.49 seconds
I can not really believe it, that it is really faster... Maybe _scan_files returned different files? But I did run it twice and it was the same. Incremental seems to be way better as
- It is far mor easy on memory, especially with many parallel threads
- Maybe faster??
- More control: I could stream up the read bytes to a remote server and therefore read the file only once
No comments:
Post a Comment