Python zlib decompress error

Содержание

Saved searches
Use saved searches to filter your results more quickly
zlib.error: Error -3 while decompressing data: incorrect data check #422
zlib.error: Error -3 while decompressing data: incorrect data check #422
Comments
zlib.error: Ошибка -3 при распаковке: неверная проверка заголовка
выбор windowBits
примеры
автоматическое определение заголовка (zlib или gzip)
вместо gzip

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zlib.error: Error -3 while decompressing data: incorrect data check #422

Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug — a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness

Comments

We have do deal with a huge amount of broken PDF files. The creator is «jsPDF 1.x-master». These files are not totally corrupted. It would be nice to get the readable content.
I found a solution on stackoverflow and it works fine for our needs.

PyPDF2/filters.py

def decompress(data): try: return zlib.decompress(data) except zlib.error: return decompress_corrupted(data) def decompress_corrupted(data): d = zlib.decompressobj(zlib.MAX_WBITS | 32) f = StringIO(data) result_str = b'' buffer = f.read(1) try: while buffer: result_str += d.decompress(buffer) buffer = f.read(1) except zlib.error: pass return result_str

The text was updated successfully, but these errors were encountered:

This fix helped me deal with an Error -5 while decompressing data: incomplete or truncated stream for what I think is an improper end of line handling of a byte stream (Windows \r\n vs. Linux \n ). I only had to replace f = StringIO(data) for f = BytesIO(data)

from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.

from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data (unproper checksum). This may be fixed by uncompressing byte per byte and ignoring the error on the checksum bytes (arbitrarily found to be the 4 last, which seems consistent with a int32 checksum). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.

from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.

There are some errors in some cases during zlib decompression (eg. I have a PDF with overlay of text, it is the same issue which is documented here py-pdf#422 ). With this change, the decompression is working without errors.

* Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep. * Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep. * Use a warnings instead of raising exception where zlib error is detected before the CRC checksum. * Add line to CHANGELOG.md * Only try decompressing if not in strict mode * Change error into warning because warning.warn needs a subclass of Warning Co-authored-by: Sylvain Thénault Co-authored-by: Pieter Marsman

MartinThoma added is-bug From a users perspective, this is a bug — a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness labels Apr 7, 2022

Источник

zlib.error: Ошибка -3 при распаковке: неверная проверка заголовка

Обновить: dnozay answer объясняет проблему и должен быть принятым ответом. Попробуйте модуль gzip , код ниже прямо из python docs.

import gzip f = gzip.open('/home/joe/file.txt.gz', 'rb') file_content = f.read() f.close()

Представлена та же ошибка: обратная трассировка (последний вызов был последним): файл ««, строка 1, в файле /usr/lib/python2.6/gzip.py «, строка 212, чтение self._read (размер для чтения) Файл «/usr/lib/python2.6/gzip.py», строка 271, в _read uncompress = self.decompress.decompress (buf) zlib.error: Ошибка -3 при распаковке: недопустимые длины кода установлены

@VarunVyas, извините, я не могу воспроизвести вашу ошибку. Это должно быть как-то связано с вашими входными данными. Был ли ваш входной файл сгенерирован с помощью gzip? Правильно ли распаковывает gunzip из командной строки?

zlib.error: Error -3 while decompressing: incorrect header check

Скорее всего, потому, что вы пытаетесь проверить заголовки, которых нет, например. ваши данные следуют RFC 1951 ( deflate сжатый формат), а не RFC 1950 ( zlib сжатый формат) или RFC 1952 ( gzip сжатый формат).

выбор windowBits

Но zlib может распаковать все эти форматы:

to (de-) compress deflate format, используйте wbits = -zlib.MAX_WBITS
to (de-) сжать формат zlib , используйте wbits = zlib.MAX_WBITS
to (de-) compress gzip format, используйте wbits = zlib.MAX_WBITS | 16

примеры

>>> deflate_compress = zlib.compressobj(9, zlib.DEFLATED, -zlib.MAX_WBITS) >>> zlib_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS) >>> gzip_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS | 16) >>> >>> text = '''test''' >>> deflate_data = deflate_compress.compress(text) + deflate_compress.flush() >>> zlib_data = zlib_compress.compress(text) + zlib_compress.flush() >>> gzip_data = gzip_compress.compress(text) + gzip_compress.flush() >>>

>>> zlib.decompress(zlib_data) 'test'

>>> zlib.decompress(deflate_data) Traceback (most recent call last): File "", line 1, in zlib.error: Error -3 while decompressing data: incorrect header check >>> zlib.decompress(deflate_data, -zlib.MAX_WBITS) 'test'

>>> zlib.decompress(gzip_data) Traceback (most recent call last): File "", line 1, in zlib.error: Error -3 while decompressing data: incorrect header check >>> zlib.decompress(gzip_data, zlib.MAX_WBITS|16) 'test'

данные также совместимы с модулем gzip :

>>> import gzip >>> import StringIO >>> fio = StringIO.StringIO(gzip_data) >>> f = gzip.GzipFile(fileobj=fio) >>> f.read() 'test' >>> f.close()

автоматическое определение заголовка (zlib или gzip)

добавление 32 в windowBits приведет к обнаружению заголовка

>>> zlib.decompress(gzip_data, zlib.MAX_WBITS|32) 'test' >>> zlib.decompress(zlib_data, zlib.MAX_WBITS|32) 'test'

вместо gzip

или вы можете игнорировать zlib и напрямую использовать модуль gzip ; но помните, что под капотом gzip использует zlib .

fh = gzip.open('abc.gz', 'rb') cdata = fh.read() fh.close()

Источник