Saturday, 20 April 2024

IO bottleneck when hashing: How many threads are best?

I want to hash all files in a directory. IO (to external SSD drive) is the bottleneck. How many threads should I use? Does it even matter?

import hashlib

def hash_path(path: str):
    h = hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            h.update(chunk)
    return h.hexdigest()
from pathlib import Path
path = Path('/media/usb/dynamic_docs')
paths = [p for p in path.glob('**/*') if p.is_file()]
paths = paths[:1000]
from multiprocessing.pool import ThreadPool 
from tqdm.notebook import tqdm

x = [1,2,3,4,6,8,10,15,20,25,30,40,50,70,80,120,140]
y = []
for n in tqdm(x):
with ThreadPool(n) as pool:
    start = timeit.default_timer()
    pool.map(hash_path, paths)
    end = timeit.default_timer()
    duration = end - start
    print(n, duration)
    y.append(duration)
1 5.8160336340006324
2 3.622440821993223
3 3.1050608700024895
4 2.9156927020012517
6 2.775160079996567
8 2.7484407580050174
10 2.7938896600026055
15 2.8651001560065197
20 2.8214209449943155
25 2.7490319029966486
30 2.8025731050001923
40 2.8451450540014775
50 2.6865460660046665
70 2.8644705260012415
80 2.8269265680064564
120 2.73746527700132
140 2.775528302998282







Thursday, 18 April 2024

Python MD5: incremental vs naive

Incremental hashing is way more easy on the RAM, expecially relevant when you have hundreds of threads doing MD5 in parallel. How much is it slower?

Naive

        paths = list(self._scan_files(self.root, max=1000))

        start = timeit.default_timer()

        for path in tqdm(paths):
            # md5 of path
            hash = hashlib.md5(path.read_bytes()).hexdigest()

        end = timeit.default_timer()
        print(f'Elapsed time: {end - start:.2f} seconds')
Elapsed time: 24.46 seconds

Incremental


        for path in tqdm(paths):
            # incremental md5 of path
            h = hashlib.md5()
            with path.open("rb") as file:
                for chunk in iter(lambda: file.read(4096), b""):
                    h.update(chunk)
            file_hash = h.hexdigest()
Elapsed time: 19.49 seconds

I can not really believe it, that it is really faster... Maybe _scan_files returned different files? But I did run it twice and it was the same. Incremental seems to be way better as

  • It is far mor easy on memory, especially with many parallel threads
  • Maybe faster??
  • More control: I could stream up the read bytes to a remote server and therefore read the file only once
When in doupt, take the incremental one, it is only a little bit more code to maintain.

Python: glob is even slower than pure python

Not as expected: glob takes twice the time than a pure python implementation. The latter even gives you more control

Here the glob version

class Crawler:

    def __init__(self):
        self.root = Path('/run/media/tom/external')
        assert self.root.exists(), f'{self.root} does not exist'

    def work(self):
        start = timeit.default_timer()
        list(self.root.glob('**/*'))
        end = timeit.default_timer()
        print(f'Elapsed time: {end - start}')

This outputs:

Elapsed time: 5.392456786998082

A pure python implementation:

class Crawler:

    def __init__(self):
        self.root = Path('/run/media/tom/external')
        assert self.root.exists(), f'{self.root} does not exist'

    def work(self):
        start = timeit.default_timer()
        list(self._scan_files(self.root))
        end = timeit.default_timer()
        print(f'Elapsed time: {end - start}')

    def _scan_files(self, directory, pbar=None) -> Iterator[Path]:
        '''
        Instead of glob('**/*'). They say its faster, but I didn't check
        '''
        queue = deque([directory])  # Start with the root directory
        while queue:
            current_dir = queue.popleft()  # Get one directory from the queue
            with os.scandir(current_dir) as scanner:
                for entry in scanner:
                    if entry.is_dir(follow_symlinks=False):
                        # Add sub-directories to the queue
                        queue.append(entry.path)
                    elif entry.is_file():
                        if pbar is not None:
                            pbar.update(1)
                        yield Path(entry.path)  # Yield each file path

Gives

Elapsed time: 2.4762056120016496

More than twice as fast! I did not expect that at all!

It even gives you more flexibility like with a progress-bar or max-elements (e.g. for prototyping)

class Crawler:

    def __init__(self):
        self.root = Path('/run/media/tom/external')
        assert self.root.exists(), f'{self.root} does not exist'

    def work(self):
        start = timeit.default_timer()
        list(self._scan_files(self.root))
        end = timeit.default_timer()
        print(f'Elapsed time: {end - start}')

    def _scan_files(self, directory, pbar=None, max=None) -> Iterator[Path]:
        '''
        Instead of glob('**/*'). They say its faster, but I didn't check
        '''
        queue = deque([directory])  # Start with the root directory
        count = 0
        while queue:
            current_dir = queue.popleft()  # Get one directory from the queue
            with os.scandir(current_dir) as scanner:
                for entry in scanner:
                    if entry.is_dir(follow_symlinks=False):
                        # Add sub-directories to the queue
                        queue.append(entry.path)
                    elif entry.is_file():
                        if pbar is not None:
                            pbar.update(1)
                        yield Path(entry.path)  # Yield each file path
                        count += 1
                        if max is not None and count >= max:
                            return

Wednesday, 17 April 2024

Python: OSError: [Errno 28] No space left on device (nixos)

Your /tmp is full. By default, its size is half of the RAM https://superuser.com/questions/1288308/why-am-i-getting-no-space-left-on-device-when-it-appears-that-theres-lots-of

systemctl mask tmp.mount
or
export TMPDIR=/var/tmp

shell.nix: tmux, zsh, poetry, cuda, make(serve,env)

{ pkgs ? import  {
  # Enable the installation of unfree packages
  config = {
    allowUnfree = true;
  };
} }:

pkgs.mkShell {
  buildInputs = with pkgs; [
    gcc
    zsh
    tmux
    cudaPackages_12_2.cudatoolkit
  ];

  shellHook = ''

    # Set the shell =================================================
    export SHELL=${pkgs.zsh}/bin/zsh
    export TMPDIR=/var/tmp # to avoid python and pip oom errors: https://superuser.com/questions/1288308/why-am-i-getting-no-space-left-on-device-when-it-appears-that-theres-lots-of

    source $(poetry env info --path)/bin/activate # instead of poetry shell

    # Setup the environment =========================================

    ## gcc
    export LD_LIBRARY_PATH=${pkgs.gcc}/lib64:${pkgs.stdenv.cc.cc.lib}/lib:$LD_LIBRARY_PATH

    ## cuda
    export CUDA_PATH=${pkgs.cudaPackages_12_2.cudatoolkit}
    export LD_LIBRARY_PATH=${pkgs.cudaPackages_12_2.cudatoolkit}/lib64:$LD_LIBRARY_PATH
    export LD_LIBRARY_PATH=/run/opengl-driver/lib:$LD_LIBRARY_PATH

    # Initialize tmux ===============================================

    # Prepare the tmux project layout ==============================

    # Get the name of the current directory to use as the tmux session name
    SESSION_NAME=$(basename "$(pwd)")

    if [ -z "$TMUX" ]; then

      # Window: Code editing
      tmux new-session -d -s "$SESSION_NAME" -n 'code'
      tmux send-keys -t "$SESSION_NAME:code" 'nvim .' C-m

      # Window: Shell with the env
      tmux new-window -t "$SESSION_NAME" -n 'shell'
      tmux send-keys -t "$SESSION_NAME:shell" 'make env' C-m

      # Window: Start server
      tmux new-window -t "$SESSION_NAME" -n 'server'
      tmux send-keys -t "$SESSION_NAME:server" 'make serve' C-m

      # Attach to code editing
      tmux select-window -t "$SESSION_NAME:code"
      tmux attach-session -t "$SESSION_NAME"
    fi

    exec ${pkgs.zsh}/bin/zsh
  '';
}

Next step

poetry add torch torchvision torchaudio

nixos, poetry: Install pytorch with cuda

https://github.com/tom-010/minimal-nixos-pytorch-example
/etx/nixos/configuration.nix:
environment.systemPackages = with pkgs; [
    cudatoolkit
    cudaPackages_12_2.cudatoolkit
  ];
shell.nix (thanks: https://discourse.nixos.org/t/installing-pytorch-into-a-virtual-python-environment/34720)
{ pkgs ? import  {} }:

# add unstable
pkgs.mkShell {
  buildInputs = with pkgs; [
    gcc
    cudaPackages_12_2.cudatoolkit
  ];

  shellHook = ''
    export LD_LIBRARY_PATH=${pkgs.gcc}/lib64:${pkgs.stdenv.cc.cc.lib}/lib:$LD_LIBRARY_PATH
    # export SHELL=${pkgs.zsh}/bin/zsh
    export CUDA_PATH=${pkgs.cudaPackages_12_2.cudatoolkit}
    export LD_LIBRARY_PATH=${pkgs.cudaPackages_12_2.cudatoolkit}/lib64:$LD_LIBRARY_PATH
    export LD_LIBRARY_PATH=/run/opengl-driver/lib:$LD_LIBRARY_PATH
    source $(poetry env info --path)/bin/activate # instead of poetry shell
    poetry run python3 gpu/main.py
  '';
}
poetry:
poetry install torch torchaudio torchvision

Tuesday, 16 April 2024

unstructured.io API

docker pull quay.io/unstructured-io/unstructured-api
20gb image. After docker-pull:
docker image inspect --format '{{json .}}' "fd79888e10ea" | jq -r '. | {Id: .Id, Digest: .Digest, RepoDigests: .RepoDigests, Labels: .Config.Labels}'
To get a fixed tag for the docker-compose:
  unstructured:
    image: quay.io/unstructured-io/unstructured-api@sha256:612d85e7a8d4816b1c71119a285238fc3bbb822f78cb00d8b47e32ef08c08031
    ports:
      - "8123:8000"

And use it
curl -X 'POST' \
  'http://localhost:8123/general/v0/general' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@example-docs/english-and-korean.png' \
  -F 'strategy=ocr_only' \
  -F 'ocr_languages=eng'  \
  -F 'ocr_languages=kor'  \
  | jq -C . | less -R
Image from here: https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/english-and-korean.png (you may want to use the example-docs dir.

xxhash is not faster than md5 on my files

xxHash is an Extremely fast Hash algorithm, processing at RAM speed limits

100x181 documents

MD5:
    start = timeit.default_timer()
    for i in tqdm(range(100)):
        total = 0
        for path in paths:
            if not path.is_file():
                continue
            hash = hashlib.md5(path.read_bytes()).hexdigest()
            total += 1
    print(total)
    end = timeit.default_timer()
    print(f"Time: {end - start}")
181
Time: 10.780842124004266
xxhash:
    start = timeit.default_timer()
    for i in tqdm(range(100)):
        total = 0
        for path in paths:
            if not path.is_file():
                continue
            hash = xxhash.xx64(path.read_bytes()).hexdigest()
            total += 1
    print(total)
    end = timeit.default_timer()
    print(f"Time: {end - start}")
181
Time: 10.775027380004758

Not significat faster. I guess most of the time is spent in IO. I go with MD5 because its more common, familiar for others and implemented everywhere.

Monday, 15 April 2024

My poetry shell.nix

https://gist.github.com/tom-010/ddb82ddcbbad2892ab60837af9dc3e18
{ pkgs ? import  {} }:

pkgs.mkShell {
  buildInputs = [
    pkgs.gcc
    pkgs.zsh
    pkgs.tmux
  ];

  shellHook = ''

    # Set the shell =================================================
    export SHELL=${pkgs.zsh}/bin/zsh
    source $(poetry env info --path)/bin/activate # instead of poetry shell

    # Setup the environment =========================================

    export LD_LIBRARY_PATH=${pkgs.gcc}/lib64:${pkgs.stdenv.cc.cc.lib}/lib:$LD_LIBRARY_PATH

    # Initialize tmux ===============================================

    # Prepare the tmux project layout ==============================

    # Get the name of the current directory to use as the tmux session name
    SESSION_NAME=$(basename "$(pwd)")

    if [ -z "$TMUX" ]; then
      # Primary: Start nvim
      tmux new-session -d -s "$SESSION_NAME" -n nvim
      tmux send-keys -t "$SESSION_NAME:nvim" 'nvim .' C-m
      # Secondary: Start server
      tmux new-window -t "$SESSION_NAME" -n 'server'
      tmux send-keys -t "$SESSION_NAME:server" 'make serve' C-m
      # Attach to nvim
      tmux select-window -t "$SESSION_NAME:nvim"
      tmux attach-session -t "$SESSION_NAME"
    fi

    exec ${pkgs.zsh}/bin/zsh
  '';
}

shell.nix for zsh, tmux, poetry

{ pkgs ? import  {} }:

pkgs.mkShell {
  buildInputs = [
    pkgs.gcc
    pkgs.zsh
    pkgs.tmux
  ];

  shellHook = ''
    export SHELL=${pkgs.zsh}/bin/zsh
    export LD_LIBRARY_PATH=${pkgs.gcc}/lib64:${pkgs.stdenv.cc.cc.lib}/lib:$LD_LIBRARY_PATH
    echo "LD_LIBRARY_PATH adjusted to include libstdc++.so.6"

    # Get the name of the current directory to use as the tmux session name
    SESSION_NAME=$(basename "$(pwd)")

    # Check if tmux is already running, if not start it with the session name as the current directory name
    if [ -z "$TMUX" ]; then
      tmux new-session -A -s "$SESSION_NAME" || tmux attach-session -t "$SESSION_NAME"
    fi

    poetry shell
    exec ${pkgs.zsh}/bin/zsh
  '';
}

Parse Wikipedia dump

""" This module processes Wikipedia dump files by extracting individual articles and parsing them into a structured format, ...