gh-121109: Fix performance of tarfile reading with "r|*"#121296
gh-121109: Fix performance of tarfile reading with "r|*"#121296TomiBelan wants to merge 3 commits into
Conversation
| if len(t) > size: | ||
| raise ReadError("decompress() returned too much data") |
There was a problem hiding this comment.
Do you see any scenario where this would be triggered? Looking at the zlib, bz2 and lzma decompressor docs for max_length, it looks like this shouldn't occur?
I have no issue with it being here but just checking if I'm missing something :)
There was a problem hiding this comment.
Right - it can only happen if there is a bug in the zlib, bz2 or lzma decompressor. I haven't checked their C source, but their docs say it should not occur.
It's not a necessary check, but I figured I'd add it just in case.
|
This PR is stale because it has been open for 30 days with no activity. |
|
My dearest stale bot, I wish it was only 30 days! 😢 |
|
Re @serhiy-storchaka #121109 (comment)
It's not needed. Small reads are already well-exercised by test_tarfile.py, especially StreamReadTest. When I add a print() to _Stream._read(), it shows a variety of This becomes clearer when you realize _Stream is just the outer shell, and the tar format parser itself also needs to read small chunks sometimes. But all right. I also tested it with this script, which succeeded. rm -rf data; mkdir data; for i in 1 2 3; do head -c1M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test1M.tar.gz data ; tar caf test1M.tar.xz data ; tar caf test1M.tar.bz2 data ; tar caf test1M.tar.zst data
rm -rf data; mkdir data; for i in 1 2 3; do head -c100M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test100M.tar.gz data ; tar caf test100M.tar.xz data ; tar caf test100M.tar.bz2 data ; tar caf test100M.tar.zst dataimport sys
import tarfile
for filename in ('test1M.tar.gz', 'test1M.tar.xz', 'test1M.tar.bz2', 'test1M.tar.zst'):
for mode in ('r|*', 'r:*'):
for chunk_size in (1, 10000, 500000):
print('running:', filename, mode, chunk_size, file=sys.stderr)
with tarfile.open(filename, mode) as tf:
for tarinfo in tf:
if tarinfo.isreg():
with tf.extractfile(tarinfo) as extractf:
total = 0
while True:
buf = extractf.read(chunk_size)
if not buf: break
total += len(buf)
assert buf == b'x' * len(buf)
assert len(buf) == chunk_size or total == tarinfo.sizeFull disclosure: this script does what you asked for, but it actually isn't a very good test. extractfile() returns a io.BufferedReader. So the 1 byte read and the 10000 byte read both become 131072 byte reads. And a benchmark: import sys
import time
import tarfile
for filename in ('test100M.tar.gz', 'test100M.tar.xz', 'test100M.tar.bz2', 'test100M.tar.zst'):
for mode in ('r|*', 'r:*'):
print('running:', filename, mode, file=sys.stderr)
start = time.time()
with tarfile.open(filename, mode) as tf:
tf.list()
print('took', time.time() - start, file=sys.stderr)I got 1.3, 1.2, 1.9, 1.5, 1.1, 1.1, 0.2, 0.2 seconds. (This is a different machine than last time.) |
|
I made some changes:
|
This PR fixes #121109.
Using the test files and test script described in the issue:
test.tar.gzr:*test.tar.gzr|*test.tar.xzr:*test.tar.xzr|*test.tar.bz2r:*test.tar.bz2r|*After this PR, tf.list() of
r|*is the same speed asr:*, as expected. Not orders of magnitude slower.