The code should be broken out in smaller pieces #135

Closed
opened 2026-02-20 22:26:13 -05:00 by deekerman · 2 comments
Owner

Originally created by @rbrito on GitHub (Oct 14, 2011).

Hi.

I was reading the code to implement the recent xvideos IE and it became obvious to me that the code is of a very heterogeneous quality: the techniques used for extraction is, sometimes, using pre-existing python modules, sometimes a module like yours 'trivial' xml parser, sometimes with hand-rolled regular expressions etc.

The coding style is also heterogeneous.

There are also some duplications of code: I can't see any reason why the _simplify_title (just to give one example) is not in the parent InfoExtractor class: most implementations of InfoExtractor would use that, anyway.

Similarly to the report_webpage and report_extraction. Factoring those out can lead to smaller code, fewer duplications, benefits when the basic implementation is improved etc.

Another thing, possibly for a longer term: perhaps the code could be, somehow, split with each information extractor in one file and, to guarantee convenience for updates/downloads, the "real" youtube-dl could be made with a simple concatenation of the files.

This would allow some modularization when we write the code, fewer paging up/down to see other parts of the code, etc. and, most important of all, being easier to be able to hold more of the program inside our heads.

Of course, it doesn't need to be that way, and that's just brainstorming, but the general principle of a "compilation step" could be taken.

Regards.

Originally created by @rbrito on GitHub (Oct 14, 2011). Hi. I was reading the code to implement the recent xvideos IE and it became obvious to me that the code is of a very heterogeneous quality: the techniques used for extraction is, sometimes, using pre-existing python modules, sometimes a module like yours 'trivial' xml parser, sometimes with hand-rolled regular expressions etc. The coding style is also heterogeneous. There are also some duplications of code: I can't see any reason why the `_simplify_title` (just to give one example) is not in the parent `InfoExtractor` class: most implementations of `InfoExtractor` would use that, anyway. Similarly to the `report_webpage` and `report_extraction`. Factoring those out can lead to smaller code, fewer duplications, benefits when the basic implementation is improved etc. Another thing, possibly for a longer term: perhaps the code could be, somehow, split with each information extractor in one file and, to guarantee convenience for updates/downloads, the "real" youtube-dl could be made with a simple concatenation of the files. This would allow some modularization when we write the code, fewer paging up/down to see other parts of the code, etc. and, most important of all, being easier to be able to hold more of the program inside our heads. Of course, it doesn't need to be that way, and that's just brainstorming, but the general principle of a "compilation step" could be taken. Regards.
Author
Owner

@phihag commented on GitHub (Oct 15, 2011):

My trivial HTML parser is still in its infancy stages. Eventually, it will replace all the regular expressions and lxml calls. Since lxml is not in the stdlib, we can't use it, and I haven't found an other good HTML parser.

My trivial JSON parser is just needed for Python 2.5. If we ever stop supporting that, we can switch to the standard JSON. JSON IEs are a good idea since they're using an API that's unlikely to break in the future, and extremely simple.

The report_ functions are not really necessary, we should indeed have a report(message) in the InfoExtractor.

Splitting the IEs into different files is a good idea, and I'll leave this bug open for that change.

_simplify_title is indeed only a placeholder. We can simply remove all references to it once we fix #164 andintroduce a new communication interface between IE and downloader, where the IE just gives one (slightly more complex) info_dict to the rest of the world, like this:

{"playlist_name": "Software languages", "videos": [
  {id:"python", "title": "Python", "versions": [
    {"url":"http://cdn.com/python/hq", "format": "hq", "ext":"ogg"},
    {"url":"http://cdn.com/python/lq", "format": "low-quality", "ext":"flv"}
  ]},
  {id:"python", "title": "C", "versions": [
    {"url":"http://cdn.com/c/lq", "format": "low-quality", "ext":"flv"}
  ]}
]}
@phihag commented on GitHub (Oct 15, 2011): My [trivial **HTML** parser](https://github.com/phihag/etreehtml-py) is still in its infancy stages. Eventually, it will replace all the regular expressions and lxml calls. Since lxml is not in the stdlib, we can't use it, and I haven't found an other good HTML parser. My [trivial **JSON** parser](https://github.com/phihag/trivialjson) is just needed for Python 2.5. If we ever stop supporting that, we can switch to the standard JSON. JSON IEs are a good idea since they're using an API that's unlikely to break in the future, and extremely simple. The report_ functions are not really necessary, we should indeed have a report(message) in the `InfoExtractor`. Splitting the IEs into different files is a good idea, and I'll leave this bug open for that change. `_simplify_title` is indeed only a placeholder. We can simply remove all references to it once we fix #164 andintroduce a new communication interface between IE and downloader, where the IE just gives one (slightly more complex) `info_dict` to the rest of the world, like this: ``` {"playlist_name": "Software languages", "videos": [ {id:"python", "title": "Python", "versions": [ {"url":"http://cdn.com/python/hq", "format": "hq", "ext":"ogg"}, {"url":"http://cdn.com/python/lq", "format": "low-quality", "ext":"flv"} ]}, {id:"python", "title": "C", "versions": [ {"url":"http://cdn.com/c/lq", "format": "low-quality", "ext":"flv"} ]} ]} ```
Author
Owner

@archaeologistdev commented on GitHub (May 23, 2014):

@phihag although the code base still leaves a lot to be desired, I think this issue can safely be closed now as most of the improvements seem to have happened ( info_dicts are now in place, for instance).

@archaeologistdev commented on GitHub (May 23, 2014): @phihag although the code base still leaves a lot to be desired, I think this issue **can safely be closed now** as most of the improvements seem to have happened ( `info_dicts` are now in place, for instance).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/youtube-dl-ytdl-org#135
No description provided.