Tuesday, 14 May 2024

Git clone with config

Got some errors while cloning:

Cloning into 'tika'...
remote: Enumerating objects: 5533, done.
remote: Counting objects: 100% (5533/5533), done.
remote: Compressing objects: 100% (3492/3492), done.
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
error: 4042 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

Solved it via:

git -c http.postBuffer=524288000 -c http.version=HTTP/1.1 -c core.compression=0 clone https://github.com/apache/tika --depth=1

No idea, which params are actually required, the important thing is that it gives me levers.

Parse Wikipedia dump

""" This module processes Wikipedia dump files by extracting individual articles and parsing them into a structured format, ...