'utf8' codec can't decode byte 0x8b in position 1: invalid start byte #288

Open
opened 2026-02-20 21:08:10 -05:00 by deekerman · 12 comments
Owner

Originally created by @jherazob on GitHub (Jun 28, 2012).

Output of youtube-dl 5gVYfDCgYxk:

[youtube] Setting language
[youtube] 5gVYfDCgYxk: Downloading video webpage
[youtube] 5gVYfDCgYxk: Downloading video info webpage
[youtube] 5gVYfDCgYxk: Extracting video information
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/jherazob/bin/youtube-dl/__main__.py", line 7, in <module>
  File "/home/jherazob/bin/youtube-dl/__init__.py", line 535, in main

  File "/home/jherazob/bin/youtube-dl/__init__.py", line 519, in _real_main

  File "/home/jherazob/bin/youtube-dl/FileDownloader.py", line 475, in download
  File "/home/jherazob/bin/youtube-dl/InfoExtractors.py", line 80, in extract
  File "/home/jherazob/bin/youtube-dl/InfoExtractors.py", line 350, in _real_extract
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte

The title is normal english text, not exotic characters. I imagine at some point it's assuming the text is unicode but it isn't.

Originally created by @jherazob on GitHub (Jun 28, 2012). Output of youtube-dl 5gVYfDCgYxk: ``` [youtube] Setting language [youtube] 5gVYfDCgYxk: Downloading video webpage [youtube] 5gVYfDCgYxk: Downloading video info webpage [youtube] 5gVYfDCgYxk: Extracting video information Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/jherazob/bin/youtube-dl/__main__.py", line 7, in <module> File "/home/jherazob/bin/youtube-dl/__init__.py", line 535, in main File "/home/jherazob/bin/youtube-dl/__init__.py", line 519, in _real_main File "/home/jherazob/bin/youtube-dl/FileDownloader.py", line 475, in download File "/home/jherazob/bin/youtube-dl/InfoExtractors.py", line 80, in extract File "/home/jherazob/bin/youtube-dl/InfoExtractors.py", line 350, in _real_extract File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte ``` The title is normal english text, not exotic characters. I imagine at some point it's assuming the text is unicode but it isn't.
Author
Owner

@FiloSottile commented on GitHub (Jun 30, 2012):

The issue is related to the age verification. I think the whole age verification and login processes are to be rewritten.

@FiloSottile commented on GitHub (Jun 30, 2012): The issue is related to the age verification. I think the whole age verification and login processes are to be rewritten.
Author
Owner

@jherazob commented on GitHub (Jul 2, 2012):

Have tested on many videos that require age verification, and yes, that seems to be exactly the problem

@jherazob commented on GitHub (Jul 2, 2012): Have tested on many videos that require age verification, and yes, that seems to be exactly the problem
Author
Owner

@zoredache commented on GitHub (Jul 6, 2012):

This seems to work as a temporary fix. There may be better solutions though that actually fix the problem. I didn't really dig into it.

diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index baf859e..4a43b46 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -352,9 +352,12 @@ class YoutubeIE(InfoExtractor):
                                        pass

                # description
-               video_description = get_element_by_id("eow-description", video_webpage.decode('utf8'))
-               if video_description: video_description = clean_html(video_description)
-               else: video_description = ''
+               try:
+                       video_description = get_element_by_id("eow-description", video_webpage.decode('utf8'))
+                       if video_description: video_description = clean_html(video_description)
+                       else: video_description = ''
+               except UnicodeDecodeError, err:
+                       video_description = ''

                # closed captions
                video_subtitles = None

Updated to reflect error mentioned by GaelicGrime

@zoredache commented on GitHub (Jul 6, 2012): This seems to work as a temporary fix. There may be better solutions though that actually fix the problem. I didn't really dig into it. ``` diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py index baf859e..4a43b46 100644 --- a/youtube_dl/InfoExtractors.py +++ b/youtube_dl/InfoExtractors.py @@ -352,9 +352,12 @@ class YoutubeIE(InfoExtractor): pass # description - video_description = get_element_by_id("eow-description", video_webpage.decode('utf8')) - if video_description: video_description = clean_html(video_description) - else: video_description = '' + try: + video_description = get_element_by_id("eow-description", video_webpage.decode('utf8')) + if video_description: video_description = clean_html(video_description) + else: video_description = '' + except UnicodeDecodeError, err: + video_description = '' # closed captions video_subtitles = None ``` Updated to reflect error mentioned by GaelicGrime
Author
Owner

@GaelicGrime commented on GitHub (Jul 10, 2012):

There is a typo in the above patch

  •          video_description = get_element_by_id("eow-description", video_webpage.decode('utf8')
    

should read

  •          video_description = get_element_by_id("eow-description", video_webpage.decode('utf8'))
    
@GaelicGrime commented on GitHub (Jul 10, 2012): There is a typo in the above patch + video_description = get_element_by_id("eow-description", video_webpage.decode('utf8') should read + video_description = get_element_by_id("eow-description", video_webpage.decode('utf8'))
Author
Owner

@Cybjit commented on GitHub (Aug 5, 2012):

For me the issue was that gzip is not handled

diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index ddb4aa1..cf1b95b 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -278,7 +278,13 @@ class YoutubeIE(InfoExtractor):
        self.report_video_webpage_download(video_id)
        request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id)
        try:
-           video_webpage = urllib2.urlopen(request).read()
+           response = urllib2.urlopen(request)
+           if response.info().get('Content-Encoding') == 'gzip':
+               buf = StringIO.StringIO(response.read())
+               f = gzip.GzipFile(fileobj=buf)
+               video_webpage = f.read()
+           else:
+               video_webpage = request.read()
        except (urllib2.URLError, httplib.HTTPException, socket.error), err:
            self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err))
            return
@Cybjit commented on GitHub (Aug 5, 2012): For me the issue was that gzip is not handled ``` diff diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py index ddb4aa1..cf1b95b 100644 --- a/youtube_dl/InfoExtractors.py +++ b/youtube_dl/InfoExtractors.py @@ -278,7 +278,13 @@ class YoutubeIE(InfoExtractor): self.report_video_webpage_download(video_id) request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id) try: - video_webpage = urllib2.urlopen(request).read() + response = urllib2.urlopen(request) + if response.info().get('Content-Encoding') == 'gzip': + buf = StringIO.StringIO(response.read()) + f = gzip.GzipFile(fileobj=buf) + video_webpage = f.read() + else: + video_webpage = request.read() except (urllib2.URLError, httplib.HTTPException, socket.error), err: self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err)) return ```
Author
Owner

@Cybjit commented on GitHub (Aug 5, 2012):

Hmm, YouTube seems to send out that header regardless if it is actually gzip.

diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index ddb4aa1..2ee8bb2 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -279,6 +279,12 @@ class YoutubeIE(InfoExtractor):
        request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id)
        try:
            video_webpage = urllib2.urlopen(request).read()
+           try:
+               buf = StringIO.StringIO(video_webpage)
+               f = gzip.GzipFile(fileobj=buf)
+               video_webpage = f.read()
+           except IOError:
+               ()
        except (urllib2.URLError, httplib.HTTPException, socket.error), err:
            self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err))
            return
@Cybjit commented on GitHub (Aug 5, 2012): Hmm, YouTube seems to send out that header regardless if it is actually gzip. ``` diff diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py index ddb4aa1..2ee8bb2 100644 --- a/youtube_dl/InfoExtractors.py +++ b/youtube_dl/InfoExtractors.py @@ -279,6 +279,12 @@ class YoutubeIE(InfoExtractor): request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id) try: video_webpage = urllib2.urlopen(request).read() + try: + buf = StringIO.StringIO(video_webpage) + f = gzip.GzipFile(fileobj=buf) + video_webpage = f.read() + except IOError: + () except (urllib2.URLError, httplib.HTTPException, socket.error), err: self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err)) return ```
Author
Owner

@DopefishJustin commented on GitHub (Aug 10, 2012):

I worked around this by replacing video_webpage.decode('utf8') with video_webpage.decode('utf8','replace') which replaces invalid characters rather than just bombing out.

@DopefishJustin commented on GitHub (Aug 10, 2012): I worked around this by replacing `video_webpage.decode('utf8')` with `video_webpage.decode('utf8','replace')` which replaces invalid characters rather than just bombing out.
Author
Owner

@rifter commented on GitHub (Oct 26, 2012):

any chance this fix will be included in the main program?

@rifter commented on GitHub (Oct 26, 2012): any chance this fix will be included in the main program?
Author
Owner

@rifter commented on GitHub (Oct 27, 2012):

Anyway I do get this but it's intermittent. Like eventually downloading the same video works, I just have to keep trying.

@rifter commented on GitHub (Oct 27, 2012): Anyway I do get this but it's intermittent. Like eventually downloading the same video works, I just have to keep trying.
Author
Owner

@phihag commented on GitHub (Oct 27, 2012):

I'll have a look at it, but the intermittent nature and no clear diagnosis (and the fact that I could never reproduce this issue) make it hard to decide. And instead of blindly decoding gzip, we should really detect it.

@phihag commented on GitHub (Oct 27, 2012): I'll have a look at it, but the intermittent nature and no clear diagnosis (and the fact that I could _never_ reproduce this issue) make it hard to decide. And instead of blindly decoding gzip, we should really detect it.
Author
Owner

@Cybjit commented on GitHub (Oct 30, 2012):

The problem is not occurring for me anymore. But I looked into detecting gzip, and the second byte is indeed 0x8b.
Second attempt, untested:

diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index 9df521d..29886c3 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -304,6 +304,10 @@ class YoutubeIE(InfoExtractor):
        request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id)
        try:
            video_webpage = urllib2.urlopen(request).read()
+           if len(video_webpage) > 2 and video_webpage[0] == '\x1f' and video_webpage[1] == '\x8b':
+               buf = StringIO.StringIO(video_webpage)
+               f = gzip.GzipFile(fileobj=buf)
+               video_webpage = f.read()
        except (urllib2.URLError, httplib.HTTPException, socket.error), err:
            self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err))
            return
@Cybjit commented on GitHub (Oct 30, 2012): The problem is not occurring for me anymore. But I looked into detecting gzip, and the second byte is indeed 0x8b. Second attempt, untested: ``` diff diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py index 9df521d..29886c3 100644 --- a/youtube_dl/InfoExtractors.py +++ b/youtube_dl/InfoExtractors.py @@ -304,6 +304,10 @@ class YoutubeIE(InfoExtractor): request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id) try: video_webpage = urllib2.urlopen(request).read() + if len(video_webpage) > 2 and video_webpage[0] == '\x1f' and video_webpage[1] == '\x8b': + buf = StringIO.StringIO(video_webpage) + f = gzip.GzipFile(fileobj=buf) + video_webpage = f.read() except (urllib2.URLError, httplib.HTTPException, socket.error), err: self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err)) return ```
Author
Owner

@ly695908698 commented on GitHub (Jun 26, 2017):

because the web is gzip,so you should unpack :

if res.info().get('Content-Encoding') == 'gzip':
buf = io.BytesIO(data) #if python2 please use StringIO.StringIO
gzip_f = gzip.GzipFile(fileobj=buf)
content = gzip_f.read()
else:
content = response.read()

@ly695908698 commented on GitHub (Jun 26, 2017): because the web is gzip,so you should unpack : if res.info().get('Content-Encoding') == 'gzip': buf = io.BytesIO(data) #if python2 please use StringIO.StringIO gzip_f = gzip.GzipFile(fileobj=buf) content = gzip_f.read() else: content = response.read()
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/youtube-dl#288
No description provided.