Remove duplicated lines from subtitles #25144

Open
opened 2026-02-21 12:23:06 -05:00 by deekerman · 9 comments
Owner

Originally created by @YakivGluck on GitHub (Apr 7, 2022).

Checklist

  • I'm reporting a feature request
  • I've verified that I'm running youtube-dl version 2021.12.17
  • I've searched the bugtracker for similar feature requests including closed ones

Description

Auto-generated subtitles downloaded from YouTube usually have duplicated lines. Removing such lines is more or less easy with additional scripts. But wouldn't it be great to have this feature integrated in youtube-dl with a separate option to activate it?

( an example of mentioned scripts )

Originally created by @YakivGluck on GitHub (Apr 7, 2022). <!-- ###################################################################### WARNING! IGNORING THE FOLLOWING TEMPLATE WILL RESULT IN ISSUE CLOSED AS INCOMPLETE ###################################################################### --> ## Checklist <!-- Carefully read and work through this check list in order to prevent the most common mistakes and misuse of youtube-dl: - First of, make sure you are using the latest version of youtube-dl. Run `youtube-dl --version` and ensure your version is 2021.12.17. If it's not, see https://yt-dl.org/update on how to update. Issues with outdated version will be REJECTED. - Search the bugtracker for similar feature requests: http://yt-dl.org/search-issues. DO NOT post duplicates. - Finally, put x into all relevant boxes (like this [x]) --> - [x] I'm reporting a feature request - [x] I've verified that I'm running youtube-dl version **2021.12.17** - [x] I've searched the bugtracker for similar feature requests including closed ones ## Description <!-- Provide an explanation of your issue in an arbitrary form. Please make sure the description is worded well enough to be understood, see https://github.com/ytdl-org/youtube-dl#is-the-description-of-the-issue-itself-sufficient. Provide any additional information, suggested solution and as much context and examples as possible. --> Auto-generated subtitles downloaded from YouTube usually have duplicated lines. Removing such lines is more or less easy with additional scripts. But wouldn't it be great to have this feature integrated in youtube-dl with a separate option to activate it? ( [an example of mentioned scripts](https://gist.github.com/davidcortesortuno/64723e4262889f592def55c1927db651) )
Author
Owner

@dirkf commented on GitHub (Apr 9, 2022):

This is an interesting idea. We would have to formulate the rules for doing this and the linked Gist doesn't give me confidence that such rules are well understood.

From my limited acquaintance with the VTT format there seem to be two straightforward cases:

  • duplicate text: consecutive cues with identical text where the cue period is identical: delete the second (and any further such cues);
  • duplicate period: consecutive cues with identical text where the start of the second cue is equal to the end of the first cue, or possibly within the period of the first cue: extend the period of the first cue to the end of the second cue and delete the second (and similarly for any further such cues).

The first case is actually a degenerate special case of the second.

Is this an issue with other subtitle formats?

@dirkf commented on GitHub (Apr 9, 2022): This is an interesting idea. We would have to formulate the rules for doing this and the linked Gist doesn't give me confidence that such rules are well understood. From my limited acquaintance with the VTT format there seem to be two straightforward cases: * duplicate text: consecutive cues with identical text where the cue period is identical: delete the second (and any further such cues); * duplicate period: consecutive cues with identical text where the start of the second cue is equal to the end of the first cue, or possibly within the period of the first cue: extend the period of the first cue to the end of the second cue and delete the second (and similarly for any further such cues). The first case is actually a degenerate special case of the second. Is this an issue with other subtitle formats?
Author
Owner

@GNtrazios commented on GitHub (May 5, 2022):

i am interested in solving this issue but i need your help.
Could u please share the urls of the videos of which vtt file have duplicated lines?

@GNtrazios commented on GitHub (May 5, 2022): i am interested in solving this issue but i need your help. Could u please share the urls of the videos of which vtt file have duplicated lines?
Author
Owner

@GNtrazios commented on GitHub (May 8, 2022):

Ιs anyone available for a few questions about some scripts?

@GNtrazios commented on GitHub (May 8, 2022): Ιs anyone available for a few questions about some scripts?
Author
Owner

@notBradPitt commented on GitHub (Jan 16, 2023):

i am interested in solving this issue but i need your help. Could u please share the urls of the videos of which vtt file have duplicated lines?

Almost all of YouTube auto-generated subtitles does this, likely because subtitles are displayed as the words are spoken and the previous line gets pushed above it

@notBradPitt commented on GitHub (Jan 16, 2023): > i am interested in solving this issue but i need your help. Could u please share the urls of the videos of which vtt file have duplicated lines? Almost all of YouTube auto-generated subtitles does this, likely because subtitles are displayed as the words are spoken and the previous line gets pushed above it
Author
Owner

@sga-13 commented on GitHub (Sep 16, 2024):

Is calling an external program within the scope?

On linux (or any posix system)

($1 is the input file here)

sed '/^[[:space:]]*$/d' "$1" | sed 's/^[ \t]*//;s/[ \t]*$//' | awk '!($1 " " $2 in arr){print; arr[$1 " " $2] = 1}'

I ran this on the generated srt file, first bit deletes trailing white spaces on the lines and deletes empty lines, second bit checks for duplicates with awk

I ran the following (it has some extra bits not related to this issue, mostly merging timestamps in the same line as the words, and deleting the end timestamps)
sed '/^[[:space:]]*$/d' "$1" | sed 's/^[ \t]*//;s/[ \t]*$//' | awk '!($1 " " $2 in arr){print; arr[$1 " " $2] = 1}' | awk '/^[[:alnum:][:space:]]*[[:alpha:]]/ {if (NR > 1) {print prev, $0;} else {print $0;}; prev = "";} {prev = $0;}' | cut -d' ' -f1,4- | sed -z 's/,[0-9]*//g' >| "$1".cleaned

and it took only 15 ms ± 0.5 ms (averaged over 20 runs) (for context, the original file was 110 KiB, after duplicate removal, it became 65 KiB, with my additions, it became 25.2 KiB), so on the systems with sed and awk, this is almost negligible overhead

for a long term solution, we can look for a python implementation

@sga-13 commented on GitHub (Sep 16, 2024): Is calling an external program within the scope? On linux (or any posix system) (`$1` is the input file here) ```sed '/^[[:space:]]*$/d' "$1" | sed 's/^[ \t]*//;s/[ \t]*$//' | awk '!($1 " " $2 in arr){print; arr[$1 " " $2] = 1}' ``` I ran this on the generated srt file, first bit deletes trailing white spaces on the lines and deletes empty lines, second bit checks for duplicates with awk I ran the following (it has some extra bits not related to this issue, mostly merging timestamps in the same line as the words, and deleting the end timestamps) ```sed '/^[[:space:]]*$/d' "$1" | sed 's/^[ \t]*//;s/[ \t]*$//' | awk '!($1 " " $2 in arr){print; arr[$1 " " $2] = 1}' | awk '/^[[:alnum:][:space:]]*[[:alpha:]]/ {if (NR > 1) {print prev, $0;} else {print $0;}; prev = "";} {prev = $0;}' | cut -d' ' -f1,4- | sed -z 's/,[0-9]*//g' >| "$1".cleaned``` and it took only 15 ms ± 0.5 ms (averaged over 20 runs) (for context, the original file was 110 KiB, after duplicate removal, it became 65 KiB, with my additions, it became 25.2 KiB), so on the systems with sed and awk, this is almost negligible overhead for a long term solution, we can look for a python implementation
Author
Owner

@dirkf commented on GitHub (Sep 16, 2024):

It should be easy enough to implement a solution in Python once we all agree on what the problem is. Unfortunately no-one was able to spend time on defining the problem beyond my original comment and one slightly generic response to clarification requests from @GNtravios.

Probably sed is one of the least suitable languages for expressing a requirement specification. Do I understand correctly that the proposed script is dealing with the first case in my original comment, linked above, and also removing trailing spaces and blank lines?

@dirkf commented on GitHub (Sep 16, 2024): It should be easy enough to implement a solution in Python once we all agree on what the problem is. Unfortunately no-one was able to spend time on defining the problem beyond [my original comment](https://github.com/ytdl-org/youtube-dl/issues/30833#issuecomment-1093903030) and one [slightly generic response](https://github.com/ytdl-org/youtube-dl/issues/30833#issuecomment-1384441837) to clarification requests from @GNtravios. Probably _sed_ is one of the least suitable languages for expressing a requirement specification. Do I understand correctly that the proposed script is dealing with the first case in my original comment, linked above, and also removing trailing spaces and blank lines?
Author
Owner

@sga-13 commented on GitHub (Sep 17, 2024):

yes, removing trailing spaces is important, as the duplicate lines sometimes have some extra white spaces, and deleting empty lines improve the awk loop (lesser total number lines)

yes I do understand having sed ruins portability, and this is one of those cases where I can also contribute with python, but I commented mostly because I saw no activity on issue. And I coincidentally just wrote the second bit today, and this issue came up.

I am trying to implement a efficient summarisation of youtube videos, hopefully with sumy (https://github.com/miso-belica/sumy), and reducing these lines make the output much better

@sga-13 commented on GitHub (Sep 17, 2024): yes, removing trailing spaces is important, as the duplicate lines sometimes have some extra white spaces, and deleting empty lines improve the awk loop (lesser total number lines) yes I do understand having sed ruins portability, and this is one of those cases where I can also contribute with python, but I commented mostly because I saw no activity on issue. And I coincidentally just wrote the second bit today, and this issue came up. I am trying to implement a efficient summarisation of youtube videos, hopefully with sumy (https://github.com/miso-belica/sumy), and reducing these lines make the output much better
Author
Owner

@dirkf commented on GitHub (Sep 17, 2024):

It's not the portability, just that a sed script is pretty opaque to anyone other than its author.

Given a list of real-world cases that need to be fixed, we can add some heuristics to fix similar cases. In addition to the cases I previously identified, it's now suggested that white-space needs to be adjusted.

Verbatim problem subtitle text extracts that can be used as test-cases are needed.

@dirkf commented on GitHub (Sep 17, 2024): It's not the portability, just that a _sed_ script is pretty opaque to anyone other than its author. Given a list of real-world cases that need to be fixed, we can add some heuristics to fix similar cases. In addition to the [cases I previously identified](https://github.com/ytdl-org/youtube-dl/issues/30833#issuecomment-1093903030), it's now suggested that white-space needs to be adjusted. Verbatim problem subtitle text extracts that can be used as test-cases are needed.
Author
Owner

@blshkv commented on GitHub (Mar 27, 2025):

https://github.com/anandiamy/Batch-Single-Clean-Youtube-Vtt
that software (also Python-based) seems working fine, feel free to implement similar.

@blshkv commented on GitHub (Mar 27, 2025): https://github.com/anandiamy/Batch-Single-Clean-Youtube-Vtt that software (also Python-based) seems working fine, feel free to implement similar.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/youtube-dl#25144
No description provided.