Tuesday, 16 April 2024

xxhash is not faster than md5 on my files

xxHash is an Extremely fast Hash algorithm, processing at RAM speed limits

100x181 documents

MD5:
    start = timeit.default_timer()
    for i in tqdm(range(100)):
        total = 0
        for path in paths:
            if not path.is_file():
                continue
            hash = hashlib.md5(path.read_bytes()).hexdigest()
            total += 1
    print(total)
    end = timeit.default_timer()
    print(f"Time: {end - start}")
181
Time: 10.780842124004266
xxhash:
    start = timeit.default_timer()
    for i in tqdm(range(100)):
        total = 0
        for path in paths:
            if not path.is_file():
                continue
            hash = xxhash.xx64(path.read_bytes()).hexdigest()
            total += 1
    print(total)
    end = timeit.default_timer()
    print(f"Time: {end - start}")
181
Time: 10.775027380004758

Not significat faster. I guess most of the time is spent in IO. I go with MD5 because its more common, familiar for others and implemented everywhere.

No comments:

Post a Comment

Parse Wikipedia dump

""" This module processes Wikipedia dump files by extracting individual articles and parsing them into a structured format, ...