YouTube pagination limit and metadata churn? #18179

Closed
opened 2026-02-21 08:32:57 -05:00 by deekerman · 4 comments
Owner

Originally created by @amcgregor on GitHub (Oct 8, 2019).

Checklist

  • I'm asking a question
  • I've looked through the README and FAQ for similar questions
  • I've searched the bugtracker for similar questions including closed ones
  • I've searched the source code for possible clues (there aren't many exit-early conditions in those loops…)
  • I've searched the internet (and Stack Overflow, Reddit, etc.) for similar questions

Question

Is it possible to limit the scope of the youtube backend's paged search for new videos?

The sheer number of paged requests is quite substantial in comparison to the number of new videos discovered on each run (1-4), which are always present on the first page. I ask not because an actual problem is being exhibited because of this (though I do have rate limit concerns), but because it's actually spending more time fetching pages than fetching video content.

If not, could this be added? --max-pages or similar? When archiving a still-living YouTube channel, I'd like to keep the amount of churn to a minimum. (Why pull in 14 pages, when 1 will do? ;)

Question

Is it normal to spend large amounts of time re-writing metadata, thumbnails, and subtitles on already-downloaded videos?

I've noted that all discovered videos are re-written on-disk to re-apply metadata and subtitles, it seems, even if they already have metadata and subtitles present. Orders of magnitude more time is spent doing rewrites of already tagged media than both paged and media fetching combined. You can see this for yourself by running the example invocation below over any channel or playlist with more than one page.

I am re-testing with the --download-archive option, to see if this alters the rewriting behavior in any way. (Maybe actual tracking is needed, as it isn't detecting metadata presence?)

Example Invocation

I'm using the following invocation for the purpose of local archiving:

youtube-dl --no-call-home --ignore-errors --restrict-filenames \
    --no-mark-watched --yes-playlist \
    --continue --no-overwrites \
    --write-description --write-info-json --write-thumbnail --write-sub \
    --add-metadata --embed-thumbnail --embed-subs \
    --merge-output-format mp4 --sub-format best --youtube-skip-dash-manifest \
    --format 137+140/bestvideo[ext=mp4]+bestaudio[ext=m4a] \
    -o "%(playlist)s/%(upload_date)s--%(id)s--%(title)s--%(resolution)s.%(ext)s" \
    $@

Example Log

[download] Downloading video 9 of 41
[youtube] PajD4X2wu50: Downloading webpage
WARNING: video doesn't have subtitles
[info] Video description is already present
[info] Video description metadata is already present
[youtube] PajD4X2wu50: Thumbnail is already present
[download] Uploads_from_acapellascience/20170623--PajD4X2wu50--LIVE_-_More_Than_Birds_ft._Singing_Chemist_Jason_Hawkins--1920x1080.mp4 has already been downloaded and merged
*** vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv ***
[ffmpeg] Adding metadata to 'Uploads_from_acapellascience/20170623--PajD4X2wu50--LIVE_-_More_Than_Birds_ft._Singing_Chemist_Jason_Hawkins--1920x1080.mp4'
[ffmpeg] There aren't any subtitles to embed
[atomicparsley] Adding thumbnail to "Uploads_from_acapellascience/20170623--PajD4X2wu50--LIVE_-_More_Than_Birds_ft._Singing_Chemist_Jason_Hawkins--1920x1080.mp4"
*** ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ***
[download] Downloading video 10 of 41
[youtube] f8FAJXPBdOg: Downloading webpage
[info] Video description is already present
[info] Writing video subtitles to: Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.en.vtt
[info] Video description metadata is already present
[youtube] f8FAJXPBdOg: Thumbnail is already present
[download] Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4 has already been downloaded and merged
*** vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv ***
[ffmpeg] Adding metadata to 'Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4'
[ffmpeg] Embedding subtitles in 'Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4'
Deleting original file Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.en.vtt (pass -k to keep)
[atomicparsley] Adding thumbnail to "Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4"
*** ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ***
Originally created by @amcgregor on GitHub (Oct 8, 2019). ## Checklist - [x] I'm asking a question - [x] I've looked through the README and FAQ for similar questions - [x] I've searched the bugtracker for similar questions including closed ones - [x] I've searched the source code for possible clues (there aren't many exit-early conditions in those loops…) - [x] I've searched the internet (and Stack Overflow, Reddit, etc.) for similar questions ## Question **Is it possible to limit the scope of the `youtube` backend's paged search for new videos?** The sheer number of paged requests is quite substantial in comparison to the number of new videos discovered on each run (1-4), which are always present on the first page. I ask not because an actual problem is being exhibited because of this (though I do have rate limit concerns), but because it's actually spending more _time_ fetching pages than fetching video content. If not, could this be added? `--max-pages` or similar? When archiving a still-living YouTube channel, I'd like to keep the amount of churn to a minimum. (Why pull in 14 pages, when 1 will do? ;) ## Question **Is it normal to spend large amounts of time re-writing metadata, thumbnails, and subtitles on already-downloaded videos?** I've noted that all discovered videos are re-written on-disk to re-apply metadata and subtitles, it seems, even if they already have metadata and subtitles present. Orders of magnitude more time is spent doing rewrites of already tagged media than both paged and media fetching combined. You can see this for yourself by running the example invocation below over any channel or playlist with more than one page. I am re-testing with the `--download-archive` option, to see if this alters the rewriting behavior in any way. (Maybe actual tracking is needed, as it isn't detecting metadata presence?) ## Example Invocation I'm using the following invocation for the purpose of local archiving: ```bash youtube-dl --no-call-home --ignore-errors --restrict-filenames \ --no-mark-watched --yes-playlist \ --continue --no-overwrites \ --write-description --write-info-json --write-thumbnail --write-sub \ --add-metadata --embed-thumbnail --embed-subs \ --merge-output-format mp4 --sub-format best --youtube-skip-dash-manifest \ --format 137+140/bestvideo[ext=mp4]+bestaudio[ext=m4a] \ -o "%(playlist)s/%(upload_date)s--%(id)s--%(title)s--%(resolution)s.%(ext)s" \ $@ ``` ## Example Log ``` [download] Downloading video 9 of 41 [youtube] PajD4X2wu50: Downloading webpage WARNING: video doesn't have subtitles [info] Video description is already present [info] Video description metadata is already present [youtube] PajD4X2wu50: Thumbnail is already present [download] Uploads_from_acapellascience/20170623--PajD4X2wu50--LIVE_-_More_Than_Birds_ft._Singing_Chemist_Jason_Hawkins--1920x1080.mp4 has already been downloaded and merged *** vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv *** [ffmpeg] Adding metadata to 'Uploads_from_acapellascience/20170623--PajD4X2wu50--LIVE_-_More_Than_Birds_ft._Singing_Chemist_Jason_Hawkins--1920x1080.mp4' [ffmpeg] There aren't any subtitles to embed [atomicparsley] Adding thumbnail to "Uploads_from_acapellascience/20170623--PajD4X2wu50--LIVE_-_More_Than_Birds_ft._Singing_Chemist_Jason_Hawkins--1920x1080.mp4" *** ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *** [download] Downloading video 10 of 41 [youtube] f8FAJXPBdOg: Downloading webpage [info] Video description is already present [info] Writing video subtitles to: Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.en.vtt [info] Video description metadata is already present [youtube] f8FAJXPBdOg: Thumbnail is already present [download] Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4 has already been downloaded and merged *** vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv *** [ffmpeg] Adding metadata to 'Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4' [ffmpeg] Embedding subtitles in 'Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4' Deleting original file Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.en.vtt (pass -k to keep) [atomicparsley] Adding thumbnail to "Uploads_from_acapellascience/20170609--f8FAJXPBdOg--The_Molecular_Shape_of_You_Ed_Sheeran_Parody_A_Capella_Science--1920x1080.mp4" *** ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *** ```
deekerman 2026-02-21 08:32:57 -05:00
Author
Owner

@amcgregor commented on GitHub (Oct 8, 2019):

Adding --download-archive _archive.ids to the arglist seems to have corrected my primary "churn" issue of rewriting to apply metadata. Edited to add: but this has changed the meaning of --max-downloads, which before would stop after churning through the first N already downloaded, now it means "download N more videos than you already have".

@amcgregor commented on GitHub (Oct 8, 2019): Adding ` --download-archive _archive.ids` to the arglist seems to have corrected my primary "churn" issue of rewriting to apply metadata. Edited to add: but this has changed the meaning of `--max-downloads`, which before would stop after churning through the first N already downloaded, now it means "download N **more** videos than you already have".
Author
Owner

@amcgregor commented on GitHub (Oct 8, 2019):

So… no answer as to the first part, regarding downloading unnecessary playlist pages?

In my archival case, the first page is truly the only one that needs to be requested. 10-100 (averaging 24) extra pages, times 157 channels… a little shy of 4,000 extra HTTP requests, on each pass through, plus all the comparisons against the archive of IDs for videos guaranteed to be there.

Adds up in time, and request limits.

@amcgregor commented on GitHub (Oct 8, 2019): So… no answer as to the first part, regarding downloading unnecessary playlist pages? In my archival case, the first page is truly the only one that needs to be requested. 10-100 (averaging 24) extra pages, times 157 channels… a little shy of 4,000 extra HTTP requests, on each pass through, plus all the comparisons against the archive of IDs for videos guaranteed to be there. Adds up in time, and request limits.
Author
Owner

@amcgregor commented on GitHub (Oct 8, 2019):

Ah, #3794 from 2014, which does replicate the title of this request, has nothing to do with actual limitation on the number of pages being requested, and more to do with a bug regarding a seeming upper bound on the number of videos collected in total. (A "limitation" in "YouTube channel pagination", not "channel pagination limit". ;)

@amcgregor commented on GitHub (Oct 8, 2019): Ah, #3794 from 2014, which does replicate the title of this request, has nothing to do with actual limitation on the number of pages being requested, and more to do with a bug regarding a seeming upper bound on the number of videos collected in total. (A "limitation" in "YouTube channel pagination", not "channel pagination limit". ;)
Author
Owner

@amcgregor commented on GitHub (Oct 9, 2019):

  • Confused.
  • Saddened.
  • Mildly curious what @dstftw thinks this is actually a duplicate of.
  • Has received any form of feedback whatsoever.
@amcgregor commented on GitHub (Oct 9, 2019): * [x] Confused. <!-- But not enough to stop me from ripping out the YouTube back-end to automate properly. --> * [x] Saddened. <!-- Faith in humanity and whatnot. --> * [x] _Mildly_ curious what @dstftw thinks this is actually a duplicate of. <!-- I've repeatedly searched, haven't found anything relevant. --> * [ ] Has received any form of feedback whatsoever. <!-- Genius. -->
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/youtube-dl#18179
No description provided.