Unable to download the YouTube channel member video #25281

Open
opened 2026-02-21 13:44:13 -05:00 by deekerman · 14 comments
Owner

Originally created by @fairfaxhshw on GitHub (May 29, 2022).

Checklist

  • I'm reporting a broken site support
  • I've verified that I'm running youtube-dl version 2021.12.17
  • I've checked that all provided URLs are alive and playable in a browser
  • I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • I've searched the bugtracker for similar issues including closed ones

Verbose log

C:\Users\OWNER\Music\youtube-dl>youtube-dl -v --cookies cookies.txt -f best --external-downloader aria2c --external-downloader-args "-j 16 -x 16 -s 16 -k 1M" https://www.youtube.com/watch?v=MCy7s-c5xAw
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '--cookies', 'cookies.txt', '-f', 'best', '--external-downloader', 'aria2c', '--external-downloader-args', '-j 16 -x 16 -s 16 -k 1M', 'https://www.youtube.com/watch?v=MCy7s-c5xAw']
[debug] Encodings: locale cp949, fs mbcs, out cp949, pref cp949
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.4.4 (CPython) - Windows-10-10.0.19041
[debug] exe versions: ffmpeg 4.4-full_build-www.gyan.dev, ffprobe 4.4-full_build-www.gyan.dev
[debug] Proxy map: {}
[youtube] MCy7s-c5xAw: Downloading webpage
WARNING: [youtube] MCy7s-c5xAw: Failed to parse JSON Unterminated string starting at: line 1 column 60361 (char 60360)
[youtube] MCy7s-c5xAw: Downloading API JSON
ERROR: This video is available to this channel's members on level: TEAM AAAA (or any higher level). Join this channel to get access to members-only content and other exclusive perks.
Traceback (most recent call last):
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\YoutubeDL.py", line 815, in wrapper
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\YoutubeDL.py", line 836, in __extract_info
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extractor\common.py", line 534, in extract
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extractor\youtube.py", line 1731, in _real_extract
youtube_dl.utils.ExtractorError: This video is available to this channel's members on level: TEAM AAAA (or any higher level). Join this channel to get access to members-only content and other exclusive perks.

Description

Trying to download the YouTube members only video, and unable to download the video.
I'm currently a member of the channel, and able to watch the YouTube video at the website.
I downloaded the most recent cookies and applied it on the script.

I'm able to download the other videos that do not require "members only" using Youtube-dl.
I was able to download the members only video about two weeks ago without a problem.

Originally created by @fairfaxhshw on GitHub (May 29, 2022). <!-- ###################################################################### WARNING! IGNORING THE FOLLOWING TEMPLATE WILL RESULT IN ISSUE CLOSED AS INCOMPLETE ###################################################################### --> ## Checklist - [x] I'm reporting a broken site support - [x] I've verified that I'm running youtube-dl version **2021.12.17** - [x] I've checked that all provided URLs are alive and playable in a browser - [x] I've checked that all URLs and arguments with special characters are properly quoted or escaped - [ ] I've searched the bugtracker for similar issues including closed ones ## Verbose log ```ShellSession C:\Users\OWNER\Music\youtube-dl>youtube-dl -v --cookies cookies.txt -f best --external-downloader aria2c --external-downloader-args "-j 16 -x 16 -s 16 -k 1M" https://www.youtube.com/watch?v=MCy7s-c5xAw [debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: ['-v', '--cookies', 'cookies.txt', '-f', 'best', '--external-downloader', 'aria2c', '--external-downloader-args', '-j 16 -x 16 -s 16 -k 1M', 'https://www.youtube.com/watch?v=MCy7s-c5xAw'] [debug] Encodings: locale cp949, fs mbcs, out cp949, pref cp949 [debug] youtube-dl version 2021.12.17 [debug] Python version 3.4.4 (CPython) - Windows-10-10.0.19041 [debug] exe versions: ffmpeg 4.4-full_build-www.gyan.dev, ffprobe 4.4-full_build-www.gyan.dev [debug] Proxy map: {} [youtube] MCy7s-c5xAw: Downloading webpage WARNING: [youtube] MCy7s-c5xAw: Failed to parse JSON Unterminated string starting at: line 1 column 60361 (char 60360) [youtube] MCy7s-c5xAw: Downloading API JSON ERROR: This video is available to this channel's members on level: TEAM AAAA (or any higher level). Join this channel to get access to members-only content and other exclusive perks. Traceback (most recent call last): File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\YoutubeDL.py", line 815, in wrapper File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\YoutubeDL.py", line 836, in __extract_info File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extractor\common.py", line 534, in extract File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extractor\youtube.py", line 1731, in _real_extract youtube_dl.utils.ExtractorError: This video is available to this channel's members on level: TEAM AAAA (or any higher level). Join this channel to get access to members-only content and other exclusive perks. ``` ## Description Trying to download the YouTube members only video, and unable to download the video. I'm currently a member of the channel, and able to watch the YouTube video at the website. I downloaded the most recent cookies and applied it on the script. I'm able to download the other videos that do not require "members only" using Youtube-dl. *I was able to download the members only video about two weeks ago without a problem.*
Author
Owner

@dirkf commented on GitHub (May 29, 2022):

Your output is the same as that for non-members except for this:

WARNING: [youtube] MCy7s-c5xAw: Failed to parse JSON Unterminated string starting at: line 1 column 60361 (char 60360)

So the cookie is having some effect. Apparently the cookie-driven page has a different structure such that the embedded hydration JSON (probably) isn't correctly extracted. The extractor then POSTs a query to
https://www.youtube.com/youtubei/v1/player, which gives JSON containing the logged error message.

If you use --write-pages and attach the resulting files, it should be possible to analyse the page and find the problem.

See also #29928. Other historical issues seem to have been due to the cookie file being incorrect or not specified correctly.

@dirkf commented on GitHub (May 29, 2022): Your output is the same as that for non-members except for this: ``` WARNING: [youtube] MCy7s-c5xAw: Failed to parse JSON Unterminated string starting at: line 1 column 60361 (char 60360) ``` So the cookie is having some effect. Apparently the cookie-driven page has a different structure such that the embedded hydration JSON (probably) isn't correctly extracted. The extractor then POSTs a query to https://www.youtube.com/youtubei/v1/player, which gives JSON containing the logged error message. If you use `--write-pages` and attach the resulting files, it should be possible to analyse the page and find the problem. See also #29928. Other historical issues seem to have been due to the cookie file being incorrect or not specified correctly.
Author
Owner

@coletdjnz commented on GitHub (May 29, 2022):

this is fixed in yt-dlp by github.com/yt-dlp/yt-dlp@ee27297f82

test video: https://www.youtube.com/watch?v=tjjjtzRLHvA

@coletdjnz commented on GitHub (May 29, 2022): this is fixed in yt-dlp by https://github.com/yt-dlp/yt-dlp/commit/ee27297f82ccbd702ccd4721d1d3c9d67bbe187e test video: https://www.youtube.com/watch?v=tjjjtzRLHvA
Author
Owner

@coletdjnz commented on GitHub (May 29, 2022):

So the cookie is having some effect. Apparently the cookie-driven page has a different structure such that the embedded hydration JSON (probably) isn't correctly extracted. The extractor then POSTs a query to https://www.youtube.com/youtubei/v1/player, which gives JSON containing the logged error message.

yeah youtube-dl has no auth support with innertube, hence this error (was one of the early things fixed in yt-dlp). this player request itself is lacking many parameters too so it doesn't always work, so youtube-dl is reliant on extracting the data from the webpage (which is failing here).

@coletdjnz commented on GitHub (May 29, 2022): > So the cookie is having some effect. Apparently the cookie-driven page has a different structure such that the embedded hydration JSON (probably) isn't correctly extracted. The extractor then POSTs a query to https://www.youtube.com/youtubei/v1/player, which gives JSON containing the logged error message. yeah youtube-dl has no auth support with innertube, hence this error (was one of the early things fixed in yt-dlp). this player request itself is lacking many parameters too so it doesn't always work, so youtube-dl is reliant on extracting the data from the webpage (which is failing here).
Author
Owner

@dirkf commented on GitHub (May 30, 2022):

this is fixed in yt-dlp by yt-dlp/yt-dlp@ee27297

Well, I could run the test video, but what are these unparseable sham JSON strings?

@dirkf commented on GitHub (May 30, 2022): > this is fixed in yt-dlp by [yt-dlp/yt-dlp@ee27297](https://github.com/yt-dlp/yt-dlp/commit/ee27297f82ccbd702ccd4721d1d3c9d67bbe187e) Well, I could run the test video, but what are these unparseable sham JSON strings?
Author
Owner

@dirkf commented on GitHub (May 30, 2022):

this is fixed in yt-dlp by yt-dlp/yt-dlp@ee27297

Really, that seems like a bit of a hack, unless there is a use case for fatal=True, lenient=True. Don't we want to know when the extraction is going wrong?

Well, I could run the test video, but what are these unparseable sham JSON strings?

I have run the test video. Aha! The test video's title is the rather antagonistic ハッシュタグ無し };if\n window.ytcsi (apparently "no hashtag};..."), which breaks the pattern used to extract the YT initial data, as \n doesn't match .+? without re.DOTALL. Also, we're looking for a block that's terminated by ; var meta = whereas YT is now setting var head = first. The fallback pattern then returns an initial substring of the JSON that crashes the parser.

The initial hydration data may also contain a potentially confusing chunk of JS as the value of its attestation.playerAttestationRendererinterpreterSafeScript.botguardData.privateDoNotAccessOrElseSafeScriptWrappedValue member. As it's minified with fewer than 3339 variables, its variables are at most 2 characters.

Finally, yt-dl has largely identical methods YoutubeIE._extract_yt_initial_variable(), YoutubeBaseInfoExtractor._extract_yt_initial_data() that should be unified as YoutubeBaseInfoExtractor._extract_yt_initial_variable() (yt-dlp has YoutubeBaseInfoExtractor.extract_yt_initial_data(), but it's not apparently used outside the YT extractor and the same could apply).

If we strip the trailing ; from the main pattern and make this _YT_INITIAL_BOUNDARY_RE

r'(?:;\s*var\s+[\w$]{3,}|;?\s*</script|;\s*?\n)'

the JSON can be correctly extracted.

--- old/youtube-dl/youtube_dl/extractor/youtube.py
+++ new/youtube-dl/youtube_dl/extractor/youtube.py
@@ -284,7 +284,7 @@
 
     _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;'
     _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;'
-    _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)'
+    _YT_INITIAL_BOUNDARY_RE = r'(?:;\s*var\s+[\w$]{3,}|;?\s*</script|;\s*?\n)'
 
     def _call_api(self, ep, query, video_id, fatal=True):
         data = self._DEFAULT_API_DATA.copy()
@@ -297,12 +297,10 @@
             headers={'content-type': 'application/json'},
             query={'key': 'AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8'})
 
-    def _extract_yt_initial_data(self, video_id, webpage):
-        return self._parse_json(
-            self._search_regex(
-                (r'%s\s*%s' % (self._YT_INITIAL_DATA_RE, self._YT_INITIAL_BOUNDARY_RE),
-                 self._YT_INITIAL_DATA_RE), webpage, 'yt initial data'),
-            video_id)
+    def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
+        return self._parse_json(self._search_regex(
+            (r'(?s)%s\s*%s' % (regex.rstrip(';'), self._YT_INITIAL_BOUNDARY_RE),
+             regex), webpage, name, default='{}'), video_id, fatal=False)
 
     def _extract_ytcfg(self, video_id, webpage):
         return self._parse_json(
@@ -1654,11 +1652,6 @@
             })
         return chapters
 
-    def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
-        return self._parse_json(self._search_regex(
-            (r'%s\s*%s' % (regex, self._YT_INITIAL_BOUNDARY_RE),
-             regex), webpage, name, default='{}'), video_id, fatal=False)
-
     def _real_extract(self, url):
         url, smuggled_data = unsmuggle_url(url, {})
         video_id = self._match_id(url)
@@ -3026,7 +3019,7 @@
                 return self.url_result(video_id, ie=YoutubeIE.ie_key(), video_id=video_id)
             self.to_screen('Downloading playlist %s - add --no-playlist to just download video %s' % (playlist_id, video_id))
         webpage = self._download_webpage(url, item_id)
-        data = self._extract_yt_initial_data(item_id, webpage)
+        data = self._extract_yt_initial_variable(webpage, self._YT_INITIAL_DATA_RE, video_id, 'yt initial data')
         tabs = try_get(
             data, lambda x: x['contents']['twoColumnBrowseResultsRenderer']['tabs'], list)
         if tabs:

And the test video tjjjtzRLHvA:

$ python -m youtube_dl -v -F --ignore-config 'https://www.youtube.com/watch?v=tjjjtzRLHvA'

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'--ignore-config', u'https://www.youtube.com/watch?v=tjjjtzRLHvA']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 04fd3289d
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[youtube] tjjjtzRLHvA: Downloading webpage
[debug] [youtube] Decrypted nsig MZKNNaj5qnOtL1kDxc-q => WhWpMgo90a-uUQ
[debug] [youtube] Decrypted nsig lxmrUIylnXPO25AvJzZk => lRCxh6n1geqddw
[info] Available formats for tjjjtzRLHvA:
format code  extension  resolution note
249          webm       audio only tiny   41k , webm_dash container, opus @ 41k (48000Hz), 27.83KiB
250          webm       audio only tiny   42k , webm_dash container, opus @ 42k (48000Hz), 28.56KiB
251          webm       audio only tiny   84k , webm_dash container, opus @ 84k (48000Hz), 56.94KiB
140          m4a        audio only tiny  130k , m4a_dash container, mp4a.40.2@130k (44100Hz), 88.41KiB
160          mp4        82x144     144p   20k , mp4_dash container, avc1.4d400b@  20k, 30fps, video only, 14.04KiB
133          mp4        136x240    144p   40k , mp4_dash container, avc1.4d400c@  40k, 30fps, video only, 26.99KiB
278          webm       144x256    144p   45k , webm_dash container, vp9@  45k, 30fps, video only, 30.65KiB
242          webm       240x426    240p   58k , webm_dash container, vp9@  58k, 30fps, video only, 39.12KiB
134          mp4        202x360    240p   75k , mp4_dash container, avc1.4d400d@  75k, 30fps, video only, 50.54KiB
135          mp4        270x480    240p  143k , mp4_dash container, avc1.4d4015@ 143k, 30fps, video only, 96.55KiB
243          webm       360x640    360p  115k , webm_dash container, vp9@ 115k, 30fps, video only, 77.29KiB
136          mp4        406x720    360p  305k , mp4_dash container, avc1.64001e@ 305k, 30fps, video only, 205.08KiB
244          webm       480x854    480p  210k , webm_dash container, vp9@ 210k, 30fps, video only, 141.36KiB
137          mp4        608x1080   480p  610k , mp4_dash container, avc1.64001f@ 610k, 30fps, video only, 410.21KiB
247          webm       720x1280   720p  549k , webm_dash container, vp9@ 549k, 30fps, video only, 368.78KiB
18           mp4        360x640    360p  426k , avc1.42001E, 30fps, mp4a.40.2 (48000Hz), 288.73KiB
22           mp4        406x720    360p  435k , avc1.64001F, 30fps, mp4a.40.2 (44100Hz) (best)
$
@dirkf commented on GitHub (May 30, 2022): > > this is fixed in yt-dlp by [yt-dlp/yt-dlp@ee27297](https://github.com/yt-dlp/yt-dlp/commit/ee27297f82ccbd702ccd4721d1d3c9d67bbe187e) Really, that seems like a bit of a hack, unless there is a use case for `fatal=True, lenient=True`. Don't we want to know when the extraction is going wrong? > Well, I could run the test video, but what are these unparseable sham JSON strings? I **have** run the test video. Aha! The test video's title is the rather antagonistic `ハッシュタグ無し };if\n window.ytcsi` (apparently "no hashtag};..."), which breaks the pattern used to extract the YT `initial data`, as `\n` doesn't match `.+?` without `re.DOTALL`. Also, we're looking for a block that's terminated by `; var meta =` whereas YT is now setting `var head =` first. The fallback pattern then returns an initial substring of the JSON that crashes the parser. The initial hydration data may also contain a potentially confusing chunk of JS as the value of its `attestation.playerAttestationRendererinterpreterSafeScript.botguardData.privateDoNotAccessOrElseSafeScriptWrappedValue` member. As it's minified with fewer than 3339 variables, its variables are at most 2 characters. Finally, yt-dl has largely identical methods `YoutubeIE._extract_yt_initial_variable()`, `YoutubeBaseInfoExtractor._extract_yt_initial_data()` that should be unified as `YoutubeBaseInfoExtractor._extract_yt_initial_variable()` (yt-dlp has `YoutubeBaseInfoExtractor.extract_yt_initial_data()`, but it's not apparently used outside the YT extractor and the same could apply). If we strip the trailing `;` from the main pattern and make this `_YT_INITIAL_BOUNDARY_RE` ``` r'(?:;\s*var\s+[\w$]{3,}|;?\s*</script|;\s*?\n)' ``` the JSON can be correctly extracted. ```diff --- old/youtube-dl/youtube_dl/extractor/youtube.py +++ new/youtube-dl/youtube_dl/extractor/youtube.py @@ -284,7 +284,7 @@ _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;' _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;' - _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)' + _YT_INITIAL_BOUNDARY_RE = r'(?:;\s*var\s+[\w$]{3,}|;?\s*</script|;\s*?\n)' def _call_api(self, ep, query, video_id, fatal=True): data = self._DEFAULT_API_DATA.copy() @@ -297,12 +297,10 @@ headers={'content-type': 'application/json'}, query={'key': 'AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8'}) - def _extract_yt_initial_data(self, video_id, webpage): - return self._parse_json( - self._search_regex( - (r'%s\s*%s' % (self._YT_INITIAL_DATA_RE, self._YT_INITIAL_BOUNDARY_RE), - self._YT_INITIAL_DATA_RE), webpage, 'yt initial data'), - video_id) + def _extract_yt_initial_variable(self, webpage, regex, video_id, name): + return self._parse_json(self._search_regex( + (r'(?s)%s\s*%s' % (regex.rstrip(';'), self._YT_INITIAL_BOUNDARY_RE), + regex), webpage, name, default='{}'), video_id, fatal=False) def _extract_ytcfg(self, video_id, webpage): return self._parse_json( @@ -1654,11 +1652,6 @@ }) return chapters - def _extract_yt_initial_variable(self, webpage, regex, video_id, name): - return self._parse_json(self._search_regex( - (r'%s\s*%s' % (regex, self._YT_INITIAL_BOUNDARY_RE), - regex), webpage, name, default='{}'), video_id, fatal=False) - def _real_extract(self, url): url, smuggled_data = unsmuggle_url(url, {}) video_id = self._match_id(url) @@ -3026,7 +3019,7 @@ return self.url_result(video_id, ie=YoutubeIE.ie_key(), video_id=video_id) self.to_screen('Downloading playlist %s - add --no-playlist to just download video %s' % (playlist_id, video_id)) webpage = self._download_webpage(url, item_id) - data = self._extract_yt_initial_data(item_id, webpage) + data = self._extract_yt_initial_variable(webpage, self._YT_INITIAL_DATA_RE, video_id, 'yt initial data') tabs = try_get( data, lambda x: x['contents']['twoColumnBrowseResultsRenderer']['tabs'], list) if tabs: ``` And the test video `tjjjtzRLHvA`: ```ShellSession $ python -m youtube_dl -v -F --ignore-config 'https://www.youtube.com/watch?v=tjjjtzRLHvA' [debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: [u'-v', u'-F', u'--ignore-config', u'https://www.youtube.com/watch?v=tjjjtzRLHvA'] [debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8 [debug] youtube-dl version 2021.12.17 [debug] Git HEAD: 04fd3289d [debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial [debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3 [debug] Proxy map: {} [youtube] tjjjtzRLHvA: Downloading webpage [debug] [youtube] Decrypted nsig MZKNNaj5qnOtL1kDxc-q => WhWpMgo90a-uUQ [debug] [youtube] Decrypted nsig lxmrUIylnXPO25AvJzZk => lRCxh6n1geqddw [info] Available formats for tjjjtzRLHvA: format code extension resolution note 249 webm audio only tiny 41k , webm_dash container, opus @ 41k (48000Hz), 27.83KiB 250 webm audio only tiny 42k , webm_dash container, opus @ 42k (48000Hz), 28.56KiB 251 webm audio only tiny 84k , webm_dash container, opus @ 84k (48000Hz), 56.94KiB 140 m4a audio only tiny 130k , m4a_dash container, mp4a.40.2@130k (44100Hz), 88.41KiB 160 mp4 82x144 144p 20k , mp4_dash container, avc1.4d400b@ 20k, 30fps, video only, 14.04KiB 133 mp4 136x240 144p 40k , mp4_dash container, avc1.4d400c@ 40k, 30fps, video only, 26.99KiB 278 webm 144x256 144p 45k , webm_dash container, vp9@ 45k, 30fps, video only, 30.65KiB 242 webm 240x426 240p 58k , webm_dash container, vp9@ 58k, 30fps, video only, 39.12KiB 134 mp4 202x360 240p 75k , mp4_dash container, avc1.4d400d@ 75k, 30fps, video only, 50.54KiB 135 mp4 270x480 240p 143k , mp4_dash container, avc1.4d4015@ 143k, 30fps, video only, 96.55KiB 243 webm 360x640 360p 115k , webm_dash container, vp9@ 115k, 30fps, video only, 77.29KiB 136 mp4 406x720 360p 305k , mp4_dash container, avc1.64001e@ 305k, 30fps, video only, 205.08KiB 244 webm 480x854 480p 210k , webm_dash container, vp9@ 210k, 30fps, video only, 141.36KiB 137 mp4 608x1080 480p 610k , mp4_dash container, avc1.64001f@ 610k, 30fps, video only, 410.21KiB 247 webm 720x1280 720p 549k , webm_dash container, vp9@ 549k, 30fps, video only, 368.78KiB 18 mp4 360x640 360p 426k , avc1.42001E, 30fps, mp4a.40.2 (48000Hz), 288.73KiB 22 mp4 406x720 360p 435k , avc1.64001F, 30fps, mp4a.40.2 (44100Hz) (best) $ ```
Author
Owner

@coletdjnz commented on GitHub (May 30, 2022):

@pukkandan (since you were the one that wrote it)

@coletdjnz commented on GitHub (May 30, 2022): @pukkandan (since you were the one that wrote it)
Author
Owner

@jim60105 commented on GitHub (May 30, 2022):

ytarchive had a similar issue a few days ago, FYI
https://github.com/Kethsar/ytarchive/issues/93#issuecomment-1140275153

@jim60105 commented on GitHub (May 30, 2022): ytarchive had a similar issue a few days ago, FYI https://github.com/Kethsar/ytarchive/issues/93#issuecomment-1140275153
Author
Owner

@fairfaxhshw commented on GitHub (May 30, 2022):

Alright. Just did "--write-pages" and attach the resulting files. I was unable to attach the dump files, so I have attach the compressed file

@fairfaxhshw commented on GitHub (May 30, 2022): Alright. Just did "--write-pages" and attach the resulting files. I was unable to attach the dump files, so I have attach the compressed file
Author
Owner

@pukkandan commented on GitHub (May 31, 2022):

Maybe lenient is not a very good keyword. What it actually does is parse the json until an error is reached. In other words, it can parse json content embedded in a larger text (like {...}<..>)

Originally, I attempted to fix this issue with just regex. But since python regex does not support recursive groups or even possessive quantifiers, it is impossible to write a foolproof regex to capture json without creating catastrophic backtracking. Eg: r'ytInitialPlayerResponse\s*=\s*({(?:"(?:\\"|[^"])+"|[^"])+});' works, but hangs indefinitely if the regex is not found on the page

Actually, it is not the first time I am encountering this issue. The same problem existed when trying to isolate {...} code blocks for jsinterp. I had written JSInterpretter._seperate_at_paren for this reason. So I could add quoting support to this (and move it to utils) to address this use-case. (Note that the regex must be changed to greedy since we can handle over-capturing, but not under-capturing)

diff --git a/yt_dlp/extractor/common.py b/yt_dlp/extractor/common.py
index b24599d5f..7b74a4b64 100644
--- a/yt_dlp/extractor/common.py
+++ b/yt_dlp/extractor/common.py
@@ -1034,8 +1034,13 @@ def _download_json(
         return res if res is False else res[0]

     def _parse_json(self, json_string, video_id, transform_source=None, fatal=True):
-        if transform_source:
-            json_string = transform_source(json_string)
+        try:
+            if transform_source:
+                json_string = transform_source(json_string)
+        except ExtractorError as e:
+            if not fatal:
+                self.report_warning(f'{video_id}: Failed to transform JSON: {e}')
+            raise
         try:
             return json.loads(json_string, strict=False)
         except ValueError as ve:
diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index 69b58088d..bf02f3d88 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -397,8 +397,8 @@ def _check_login_required(self):
         if self._LOGIN_REQUIRED and not self._cookies_passed:
             self.raise_login_required('Login details are needed to download this content', method='cookies')

-    _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;'
-    _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;'
+    _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+})\s*;'
+    _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+})\s*;'
     _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)'

     def _get_default_ytcfg(self, client='web'):
@@ -2743,9 +2743,10 @@ def _extract_chapters(self, chapter_list, chapter_time, chapter_title, duration)
         return chapters

     def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
-        return self._parse_json(self._search_regex(
-            (fr'{regex}\s*{self._YT_INITIAL_BOUNDARY_RE}',
-             regex), webpage, name, default='{}'), video_id, fatal=False)
+        return self._parse_json(
+            self._search_regex(regex, webpage, name, default='{}'),
+            video_id, fatal=False,
+            transform_source=lambda x: '{%s}' % JSInterpreter._separate_at_paren(x, '}')[0])

     def _extract_comment(self, comment_renderer, parent=None):
         comment_id = comment_renderer.get('commentId')
diff --git a/yt_dlp/jsinterp.py b/yt_dlp/jsinterp.py
index 70857b798..56229cd99 100644
--- a/yt_dlp/jsinterp.py
+++ b/yt_dlp/jsinterp.py
@@ -24,6 +24,7 @@
 _NAME_RE = r'[a-zA-Z_$][a-zA-Z_$0-9]*'

 _MATCHING_PARENS = dict(zip('({[', ')}]'))
+_QUOTES = '\'"'


 class JS_Break(ExtractorError):
@@ -69,12 +70,17 @@ def _separate(expr, delim=',', max_split=None):
             return
         counters = {k: 0 for k in _MATCHING_PARENS.values()}
         start, splits, pos, delim_len = 0, 0, 0, len(delim) - 1
+        in_quote, escaping = None, False
         for idx, char in enumerate(expr):
             if char in _MATCHING_PARENS:
                 counters[_MATCHING_PARENS[char]] += 1
             elif char in counters:
                 counters[char] -= 1
-            if char != delim[pos] or any(counters.values()):
+            elif not escaping and char in _QUOTES and in_quote in (char, None):
+                in_quote = None if in_quote else char
+            escaping = not escaping and in_quote and char == '\\'
+
+            if char != delim[pos] or any(counters.values()) or in_quote:
                 pos = 0
                 continue
             elif pos != delim_len:

But when I thought about it more, this is what json.loads already does in JSONDecoder.raw_decode. The only difference is that the stdlib raises when the unparsed section is not just whitespace. So we can just catch that error, trim the json at the point of error, and try to parse it again. This is how I ended up with the current implementation.

Another solution could be to create a custom parser.

diff --git a/yt_dlp/extractor/common.py b/yt_dlp/extractor/common.py
index b24599d5f..d43280b07 100644
--- a/yt_dlp/extractor/common.py
+++ b/yt_dlp/extractor/common.py
@@ -35,6 +35,7 @@
     ExtractorError,
     GeoRestrictedError,
     GeoUtils,
+    LenientJSONDecoder,
     RegexNotFoundError,
     UnsupportedError,
     age_restricted,
@@ -1033,11 +1034,11 @@ def _download_json(
             expected_status=expected_status)
         return res if res is False else res[0]

-    def _parse_json(self, json_string, video_id, transform_source=None, fatal=True):
+    def _parse_json(self, json_string, video_id, transform_source=None, fatal=True, lenient=False):
         if transform_source:
             json_string = transform_source(json_string)
         try:
-            return json.loads(json_string, strict=False)
+            return json.loads(json_string, strict=False, cls=LenientJSONDecoder if lenient else None)
         except ValueError as ve:
             errmsg = '%s: Failed to parse JSON ' % video_id
             if fatal:
diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index 245778dff..ee36c229f 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -397,8 +397,8 @@ def _check_login_required(self):
         if self._LOGIN_REQUIRED and not self._cookies_passed:
             self.raise_login_required('Login details are needed to download this content', method='cookies')

-    _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;'
-    _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;'
+    _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+})\s*;'
+    _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+})\s*;'
     _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)'

     def _get_default_ytcfg(self, client='web'):
@@ -2754,7 +2754,7 @@ def _extract_chapters(self, chapter_list, chapter_time, chapter_title, duration)
     def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
         return self._parse_json(self._search_regex(
             (fr'{regex}\s*{self._YT_INITIAL_BOUNDARY_RE}',
-             regex), webpage, name, default='{}'), video_id, fatal=False)
+             regex), webpage, name, default='{}'), video_id, fatal=False, lenient=True)

     def _extract_comment(self, comment_renderer, parent=None):
         comment_id = comment_renderer.get('commentId')
diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py
index b0300b724..ee858afaf 100644
--- a/yt_dlp/utils.py
+++ b/yt_dlp/utils.py
@@ -5381,6 +5381,13 @@ def __repr__(self):
         return f'{type(self).__name__}({", ".join(f"{k}={v}" for k, v in self)})'


+class LenientJSONDecoder(json.JSONDecoder):
+    """JSONDecoder that ignores excess text"""
+
+    def decode(self, s):
+        return self.raw_decode(s.lstrip())[0]
+
+
 # Deprecated
 has_certifi = bool(certifi)
 has_websockets = bool(websockets)

PS: Feel free to copy the code for any of these solutions (I honestly wouldn't recommend the regex though)

@pukkandan commented on GitHub (May 31, 2022): Maybe `lenient` is not a very good keyword. What it actually does is parse the json until an error is reached. In other words, it can parse json content embedded in a larger text (like `{...}<..>`) Originally, I attempted to fix this issue with just regex. But since python regex does not support recursive groups or even possessive quantifiers, it is impossible to write a foolproof regex to capture json without creating catastrophic backtracking. Eg: `r'ytInitialPlayerResponse\s*=\s*({(?:"(?:\\"|[^"])+"|[^"])+});'` works, but hangs indefinitely if the regex is **not** found on the page Actually, it is not the first time I am encountering this issue. The same problem existed when trying to isolate `{...}` code blocks for `jsinterp`. I had written `JSInterpretter._seperate_at_paren` for this reason. So I could add quoting support to this (and move it to utils) to address this use-case. (Note that the regex must be changed to greedy since we can handle over-capturing, but not under-capturing) ```diff diff --git a/yt_dlp/extractor/common.py b/yt_dlp/extractor/common.py index b24599d5f..7b74a4b64 100644 --- a/yt_dlp/extractor/common.py +++ b/yt_dlp/extractor/common.py @@ -1034,8 +1034,13 @@ def _download_json( return res if res is False else res[0] def _parse_json(self, json_string, video_id, transform_source=None, fatal=True): - if transform_source: - json_string = transform_source(json_string) + try: + if transform_source: + json_string = transform_source(json_string) + except ExtractorError as e: + if not fatal: + self.report_warning(f'{video_id}: Failed to transform JSON: {e}') + raise try: return json.loads(json_string, strict=False) except ValueError as ve: diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py index 69b58088d..bf02f3d88 100644 --- a/yt_dlp/extractor/youtube.py +++ b/yt_dlp/extractor/youtube.py @@ -397,8 +397,8 @@ def _check_login_required(self): if self._LOGIN_REQUIRED and not self._cookies_passed: self.raise_login_required('Login details are needed to download this content', method='cookies') - _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;' - _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;' + _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+})\s*;' + _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+})\s*;' _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)' def _get_default_ytcfg(self, client='web'): @@ -2743,9 +2743,10 @@ def _extract_chapters(self, chapter_list, chapter_time, chapter_title, duration) return chapters def _extract_yt_initial_variable(self, webpage, regex, video_id, name): - return self._parse_json(self._search_regex( - (fr'{regex}\s*{self._YT_INITIAL_BOUNDARY_RE}', - regex), webpage, name, default='{}'), video_id, fatal=False) + return self._parse_json( + self._search_regex(regex, webpage, name, default='{}'), + video_id, fatal=False, + transform_source=lambda x: '{%s}' % JSInterpreter._separate_at_paren(x, '}')[0]) def _extract_comment(self, comment_renderer, parent=None): comment_id = comment_renderer.get('commentId') diff --git a/yt_dlp/jsinterp.py b/yt_dlp/jsinterp.py index 70857b798..56229cd99 100644 --- a/yt_dlp/jsinterp.py +++ b/yt_dlp/jsinterp.py @@ -24,6 +24,7 @@ _NAME_RE = r'[a-zA-Z_$][a-zA-Z_$0-9]*' _MATCHING_PARENS = dict(zip('({[', ')}]')) +_QUOTES = '\'"' class JS_Break(ExtractorError): @@ -69,12 +70,17 @@ def _separate(expr, delim=',', max_split=None): return counters = {k: 0 for k in _MATCHING_PARENS.values()} start, splits, pos, delim_len = 0, 0, 0, len(delim) - 1 + in_quote, escaping = None, False for idx, char in enumerate(expr): if char in _MATCHING_PARENS: counters[_MATCHING_PARENS[char]] += 1 elif char in counters: counters[char] -= 1 - if char != delim[pos] or any(counters.values()): + elif not escaping and char in _QUOTES and in_quote in (char, None): + in_quote = None if in_quote else char + escaping = not escaping and in_quote and char == '\\' + + if char != delim[pos] or any(counters.values()) or in_quote: pos = 0 continue elif pos != delim_len: ``` But when I thought about it more, this is what `json.loads` already does in `JSONDecoder.raw_decode`. The only difference is that the stdlib raises when the unparsed section is not just whitespace. So we can just catch that error, trim the json at the point of error, and try to parse it again. This is how I ended up with the current implementation. Another solution could be to create a custom parser. ```diff diff --git a/yt_dlp/extractor/common.py b/yt_dlp/extractor/common.py index b24599d5f..d43280b07 100644 --- a/yt_dlp/extractor/common.py +++ b/yt_dlp/extractor/common.py @@ -35,6 +35,7 @@ ExtractorError, GeoRestrictedError, GeoUtils, + LenientJSONDecoder, RegexNotFoundError, UnsupportedError, age_restricted, @@ -1033,11 +1034,11 @@ def _download_json( expected_status=expected_status) return res if res is False else res[0] - def _parse_json(self, json_string, video_id, transform_source=None, fatal=True): + def _parse_json(self, json_string, video_id, transform_source=None, fatal=True, lenient=False): if transform_source: json_string = transform_source(json_string) try: - return json.loads(json_string, strict=False) + return json.loads(json_string, strict=False, cls=LenientJSONDecoder if lenient else None) except ValueError as ve: errmsg = '%s: Failed to parse JSON ' % video_id if fatal: diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py index 245778dff..ee36c229f 100644 --- a/yt_dlp/extractor/youtube.py +++ b/yt_dlp/extractor/youtube.py @@ -397,8 +397,8 @@ def _check_login_required(self): if self._LOGIN_REQUIRED and not self._cookies_passed: self.raise_login_required('Login details are needed to download this content', method='cookies') - _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;' - _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;' + _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+})\s*;' + _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+})\s*;' _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)' def _get_default_ytcfg(self, client='web'): @@ -2754,7 +2754,7 @@ def _extract_chapters(self, chapter_list, chapter_time, chapter_title, duration) def _extract_yt_initial_variable(self, webpage, regex, video_id, name): return self._parse_json(self._search_regex( (fr'{regex}\s*{self._YT_INITIAL_BOUNDARY_RE}', - regex), webpage, name, default='{}'), video_id, fatal=False) + regex), webpage, name, default='{}'), video_id, fatal=False, lenient=True) def _extract_comment(self, comment_renderer, parent=None): comment_id = comment_renderer.get('commentId') diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py index b0300b724..ee858afaf 100644 --- a/yt_dlp/utils.py +++ b/yt_dlp/utils.py @@ -5381,6 +5381,13 @@ def __repr__(self): return f'{type(self).__name__}({", ".join(f"{k}={v}" for k, v in self)})' +class LenientJSONDecoder(json.JSONDecoder): + """JSONDecoder that ignores excess text""" + + def decode(self, s): + return self.raw_decode(s.lstrip())[0] + + # Deprecated has_certifi = bool(certifi) has_websockets = bool(websockets) ``` PS: Feel free to copy the code for any of these solutions (I honestly wouldn't recommend the regex though)
Author
Owner

@dirkf commented on GitHub (May 31, 2022):

Yes, not a hack at all, or rather, an excellent hack, when you actually read the code properly. Finding the end of a JSON block with regex is clearly unviable in general so it's much better to use the parser from the json module.

I'm told that the Go JSON parser has this lenience built in, which is why the ytarchive change mentioned above also did this:

... regex must be changed to greedy since we can handle over-capturing.

One comment:

        ...
        try:
            # should be outside the try block?
            if transform_source:
                json_string = transform_source(json_string)
        except ExtractorError as e:
        ...

Adding a decoder kw (default json.JSONDecoder) to _parse_json() might be good as some APIs may send XML or some other return in certain cases (error, eg) and it would be easier to handle that with a decoder class than transform_source. It could also replace transform_source though fatal handling as above would be less straightforward. A function as below could be applied to make a class for a transform_source function:

def json_transformer(transform_source):
    class xf(json.JSONDecoder):
        # in CPython 2.7 decode() will call this raw_decode()
        # with secret kwargs: check other implementations
        def raw_decode(self, s, **kwargs):
            s = transform_source(s)
            return super(xf, self).raw_decode(s, **kwargs)

    return xf

Or the transform_source could be tested and treated as a decoder class:

    def _parse_json(self, json_string, video_id, transform_source=None, fatal=True):
        if type(transform_source) == 'function':
            try:
                json_string = transform_source(json_string)
            except ExtractorError as e:
                if not fatal:
                    self.report_warning('{0}: Failed to transform JSON: {1}'.format(video_id, e))
                raise
        try:
            # allow duck typing, not just subclass of JSONDecoder
            if type(transform_source) == 'type':
                return json.loads(json_string, strict=False, cls=transform_source)
            return json.loads(json_string, strict=False)
        except ValueError as ve:
            ...
@dirkf commented on GitHub (May 31, 2022): Yes, not a hack at all, or rather, an excellent hack, when you actually read the code properly. Finding the end of a JSON block with regex is clearly unviable in general so it's much better to use the parser from the json module. I'm told that the Go JSON parser has this lenience built in, which is why the [ytarchive change mentioned above](https://github.com/Kethsar/ytarchive/issues/93#issuecomment-1140275153) also did this: >... regex must be changed to greedy since we can handle over-capturing. One comment: ```py ... try: # should be outside the try block? if transform_source: json_string = transform_source(json_string) except ExtractorError as e: ... ``` Adding a `decoder` kw (default `json.JSONDecoder`) to `_parse_json()` might be good as some APIs may send XML or some other return in certain cases (error, eg) and it would be easier to handle that with a decoder class than `transform_source`. It could also replace `transform_source` though `fatal` handling as above would be less straightforward. A function as below could be applied to make a class for a `transform_source` function: ```py def json_transformer(transform_source): class xf(json.JSONDecoder): # in CPython 2.7 decode() will call this raw_decode() # with secret kwargs: check other implementations def raw_decode(self, s, **kwargs): s = transform_source(s) return super(xf, self).raw_decode(s, **kwargs) return xf ``` Or the `transform_source` could be tested and treated as a decoder class: ```py def _parse_json(self, json_string, video_id, transform_source=None, fatal=True): if type(transform_source) == 'function': try: json_string = transform_source(json_string) except ExtractorError as e: if not fatal: self.report_warning('{0}: Failed to transform JSON: {1}'.format(video_id, e)) raise try: # allow duck typing, not just subclass of JSONDecoder if type(transform_source) == 'type': return json.loads(json_string, strict=False, cls=transform_source) return json.loads(json_string, strict=False) except ValueError as ve: ... ```
Author
Owner

@dirkf commented on GitHub (May 31, 2022):

Just did "--write-pages" and attach the resulting files.

Thanks. I checked the YT page that you dumped, and the same problem that I analysed above applies. So it should be fixed by the modified YT extractor.

@dirkf commented on GitHub (May 31, 2022): > Just did "--write-pages" and attach the resulting files. Thanks. I checked the YT page that you dumped, and the same problem that I analysed [above](https://github.com/ytdl-org/youtube-dl/issues/30987#issuecomment-1141286650) applies. So it should be fixed by the modified YT extractor.
Author
Owner

@pukkandan commented on GitHub (Jun 6, 2022):

FYI, yt-dlp's implementation has been changed to use a custom decoder. github.com/yt-dlp/yt-dlp@b7c47b7438

@pukkandan commented on GitHub (Jun 6, 2022): FYI, yt-dlp's implementation has been changed to use a custom decoder. https://github.com/yt-dlp/yt-dlp/commit/b7c47b743871cdf3e0de75b17e4454d987384bf9
Author
Owner

@pukkandan commented on GitHub (Jun 6, 2022):

as some APIs may send XML or some other return in certain cases (error, eg) and it would be easier to handle that with a decoder class than transform_source

Why would you use _parse_json for XML? There is a different parser for it

@pukkandan commented on GitHub (Jun 6, 2022): > as some APIs may send XML or some other return in certain cases (error, eg) and it would be easier to handle that with a decoder class than transform_source Why would you use _parse_json for XML? There is a different parser for it
Author
Owner
@dirkf commented on GitHub (Jun 6, 2022): The case I found was [a JSON API at trt.com that returned JSON normally, except that when the API failed it returned XML instead](https://github.com/ytdl-org/youtube-dl/blob/66ee6aa2da2d71ababcbce6b7604c29dc83d2c82/youtube_dl/extractor/trt.py#L244).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/youtube-dl-ytdl-org#25281
No description provided.