Tuesday, 14 May 2024

Git clone with config

Got some errors while cloning:

Cloning into 'tika'...
remote: Enumerating objects: 5533, done.
remote: Counting objects: 100% (5533/5533), done.
remote: Compressing objects: 100% (3492/3492), done.
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
error: 4042 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

Solved it via:

git -c http.postBuffer=524288000 -c http.version=HTTP/1.1 -c core.compression=0 clone https://github.com/apache/tika --depth=1

No idea, which params are actually required, the important thing is that it gives me levers.

Saturday, 20 April 2024

IO bottleneck when hashing: How many threads are best?

I want to hash all files in a directory. IO (to external SSD drive) is the bottleneck. How many threads should I use? Does it even matter?

import hashlib

def hash_path(path: str):
    h = hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            h.update(chunk)
    return h.hexdigest()
from pathlib import Path
path = Path('/media/usb/dynamic_docs')
paths = [p for p in path.glob('**/*') if p.is_file()]
paths = paths[:1000]
from multiprocessing.pool import ThreadPool 
from tqdm.notebook import tqdm

x = [1,2,3,4,6,8,10,15,20,25,30,40,50,70,80,120,140]
y = []
for n in tqdm(x):
with ThreadPool(n) as pool:
    start = timeit.default_timer()
    pool.map(hash_path, paths)
    end = timeit.default_timer()
    duration = end - start
    print(n, duration)
    y.append(duration)
1 5.8160336340006324
2 3.622440821993223
3 3.1050608700024895
4 2.9156927020012517
6 2.775160079996567
8 2.7484407580050174
10 2.7938896600026055
15 2.8651001560065197
20 2.8214209449943155
25 2.7490319029966486
30 2.8025731050001923
40 2.8451450540014775
50 2.6865460660046665
70 2.8644705260012415
80 2.8269265680064564
120 2.73746527700132
140 2.775528302998282







Thursday, 18 April 2024

Python MD5: incremental vs naive

Incremental hashing is way more easy on the RAM, expecially relevant when you have hundreds of threads doing MD5 in parallel. How much is it slower?

Naive

        paths = list(self._scan_files(self.root, max=1000))

        start = timeit.default_timer()

        for path in tqdm(paths):
            # md5 of path
            hash = hashlib.md5(path.read_bytes()).hexdigest()

        end = timeit.default_timer()
        print(f'Elapsed time: {end - start:.2f} seconds')
Elapsed time: 24.46 seconds

Incremental


        for path in tqdm(paths):
            # incremental md5 of path
            h = hashlib.md5()
            with path.open("rb") as file:
                for chunk in iter(lambda: file.read(4096), b""):
                    h.update(chunk)
            file_hash = h.hexdigest()
Elapsed time: 19.49 seconds

I can not really believe it, that it is really faster... Maybe _scan_files returned different files? But I did run it twice and it was the same. Incremental seems to be way better as

  • It is far mor easy on memory, especially with many parallel threads
  • Maybe faster??
  • More control: I could stream up the read bytes to a remote server and therefore read the file only once
When in doupt, take the incremental one, it is only a little bit more code to maintain.

Python: glob is even slower than pure python

Not as expected: glob takes twice the time than a pure python implementation. The latter even gives you more control

Here the glob version

class Crawler:

    def __init__(self):
        self.root = Path('/run/media/tom/external')
        assert self.root.exists(), f'{self.root} does not exist'

    def work(self):
        start = timeit.default_timer()
        list(self.root.glob('**/*'))
        end = timeit.default_timer()
        print(f'Elapsed time: {end - start}')

This outputs:

Elapsed time: 5.392456786998082

A pure python implementation:

class Crawler:

    def __init__(self):
        self.root = Path('/run/media/tom/external')
        assert self.root.exists(), f'{self.root} does not exist'

    def work(self):
        start = timeit.default_timer()
        list(self._scan_files(self.root))
        end = timeit.default_timer()
        print(f'Elapsed time: {end - start}')

    def _scan_files(self, directory, pbar=None) -> Iterator[Path]:
        '''
        Instead of glob('**/*'). They say its faster, but I didn't check
        '''
        queue = deque([directory])  # Start with the root directory
        while queue:
            current_dir = queue.popleft()  # Get one directory from the queue
            with os.scandir(current_dir) as scanner:
                for entry in scanner:
                    if entry.is_dir(follow_symlinks=False):
                        # Add sub-directories to the queue
                        queue.append(entry.path)
                    elif entry.is_file():
                        if pbar is not None:
                            pbar.update(1)
                        yield Path(entry.path)  # Yield each file path

Gives

Elapsed time: 2.4762056120016496

More than twice as fast! I did not expect that at all!

It even gives you more flexibility like with a progress-bar or max-elements (e.g. for prototyping)

class Crawler:

    def __init__(self):
        self.root = Path('/run/media/tom/external')
        assert self.root.exists(), f'{self.root} does not exist'

    def work(self):
        start = timeit.default_timer()
        list(self._scan_files(self.root))
        end = timeit.default_timer()
        print(f'Elapsed time: {end - start}')

    def _scan_files(self, directory, pbar=None, max=None) -> Iterator[Path]:
        '''
        Instead of glob('**/*'). They say its faster, but I didn't check
        '''
        queue = deque([directory])  # Start with the root directory
        count = 0
        while queue:
            current_dir = queue.popleft()  # Get one directory from the queue
            with os.scandir(current_dir) as scanner:
                for entry in scanner:
                    if entry.is_dir(follow_symlinks=False):
                        # Add sub-directories to the queue
                        queue.append(entry.path)
                    elif entry.is_file():
                        if pbar is not None:
                            pbar.update(1)
                        yield Path(entry.path)  # Yield each file path
                        count += 1
                        if max is not None and count >= max:
                            return

Wednesday, 17 April 2024

Python: OSError: [Errno 28] No space left on device (nixos)

Your /tmp is full. By default, its size is half of the RAM https://superuser.com/questions/1288308/why-am-i-getting-no-space-left-on-device-when-it-appears-that-theres-lots-of

systemctl mask tmp.mount
or
export TMPDIR=/var/tmp

shell.nix: tmux, zsh, poetry, cuda, make(serve,env)

{ pkgs ? import  {
  # Enable the installation of unfree packages
  config = {
    allowUnfree = true;
  };
} }:

pkgs.mkShell {
  buildInputs = with pkgs; [
    gcc
    zsh
    tmux
    cudaPackages_12_2.cudatoolkit
  ];

  shellHook = ''

    # Set the shell =================================================
    export SHELL=${pkgs.zsh}/bin/zsh
    export TMPDIR=/var/tmp # to avoid python and pip oom errors: https://superuser.com/questions/1288308/why-am-i-getting-no-space-left-on-device-when-it-appears-that-theres-lots-of

    source $(poetry env info --path)/bin/activate # instead of poetry shell

    # Setup the environment =========================================

    ## gcc
    export LD_LIBRARY_PATH=${pkgs.gcc}/lib64:${pkgs.stdenv.cc.cc.lib}/lib:$LD_LIBRARY_PATH

    ## cuda
    export CUDA_PATH=${pkgs.cudaPackages_12_2.cudatoolkit}
    export LD_LIBRARY_PATH=${pkgs.cudaPackages_12_2.cudatoolkit}/lib64:$LD_LIBRARY_PATH
    export LD_LIBRARY_PATH=/run/opengl-driver/lib:$LD_LIBRARY_PATH

    # Initialize tmux ===============================================

    # Prepare the tmux project layout ==============================

    # Get the name of the current directory to use as the tmux session name
    SESSION_NAME=$(basename "$(pwd)")

    if [ -z "$TMUX" ]; then

      # Window: Code editing
      tmux new-session -d -s "$SESSION_NAME" -n 'code'
      tmux send-keys -t "$SESSION_NAME:code" 'nvim .' C-m

      # Window: Shell with the env
      tmux new-window -t "$SESSION_NAME" -n 'shell'
      tmux send-keys -t "$SESSION_NAME:shell" 'make env' C-m

      # Window: Start server
      tmux new-window -t "$SESSION_NAME" -n 'server'
      tmux send-keys -t "$SESSION_NAME:server" 'make serve' C-m

      # Attach to code editing
      tmux select-window -t "$SESSION_NAME:code"
      tmux attach-session -t "$SESSION_NAME"
    fi

    exec ${pkgs.zsh}/bin/zsh
  '';
}

Next step

poetry add torch torchvision torchaudio

nixos, poetry: Install pytorch with cuda

https://github.com/tom-010/minimal-nixos-pytorch-example
/etx/nixos/configuration.nix:
environment.systemPackages = with pkgs; [
    cudatoolkit
    cudaPackages_12_2.cudatoolkit
  ];
shell.nix (thanks: https://discourse.nixos.org/t/installing-pytorch-into-a-virtual-python-environment/34720)
{ pkgs ? import  {} }:

# add unstable
pkgs.mkShell {
  buildInputs = with pkgs; [
    gcc
    cudaPackages_12_2.cudatoolkit
  ];

  shellHook = ''
    export LD_LIBRARY_PATH=${pkgs.gcc}/lib64:${pkgs.stdenv.cc.cc.lib}/lib:$LD_LIBRARY_PATH
    # export SHELL=${pkgs.zsh}/bin/zsh
    export CUDA_PATH=${pkgs.cudaPackages_12_2.cudatoolkit}
    export LD_LIBRARY_PATH=${pkgs.cudaPackages_12_2.cudatoolkit}/lib64:$LD_LIBRARY_PATH
    export LD_LIBRARY_PATH=/run/opengl-driver/lib:$LD_LIBRARY_PATH
    source $(poetry env info --path)/bin/activate # instead of poetry shell
    poetry run python3 gpu/main.py
  '';
}
poetry:
poetry install torch torchaudio torchvision

Parse Wikipedia dump

""" This module processes Wikipedia dump files by extracting individual articles and parsing them into a structured format, ...