Making an IE for english.cntv.cn #642

Closed
opened 2026-02-20 23:09:20 -05:00 by deekerman · 3 comments
Owner

Originally created by @yasoob on GitHub (Jun 8, 2013).

Hi sorry for disturbing again. I am making an IE for english.cntv.cn and having a little problem again 😅 my test code is

import requests
import re
import sys
import json

def get_url(url):
    _VALID_URL = r'(?:http://)?(?:www\.)?english.cntv\.cn/program/([^/]+)/([^/]+)/([^/]+)\.shtml'
    mobj = re.match(_VALID_URL, url)
    if mobj is None:
        print u'Invalid URL: %s' % url
    print "Opening main page"
    html = requests.get(url)
    id = re.search(r'fo.addVariable\("videoCenterId","(.*)"\);fo.addVariable\("channelId",channelId_code\)',html.text)
    editor = (re.search(r'<b>Editor:</b>(.*)<b>Source:',html.text)).group(1)
    editor = (editor.strip('|')).strip()
    print "Opening Info_page"
    info = json.loads(requests.get('http://vdn.apps.cntv.cn/api/getHttpVideoInfo.do?pid='+ id.group(1)).text)
    title = info['title']
    video = info['video']
    chapters = video['chapters2'] if 'chapters2' in video else video['chapters']
    for x in chapters:
        urls = [x['url']]
    urls = [x['url'] for x in chapters]
    ext = "mp4"
    print {'url'     :  urls,
           'title'   :  title,
           'ext'     :  ext,
           'editor'  :  editor
    }

if __name__ == '__main__':
    url = sys.argv[-1]
    get_url(url)

Now the problem is that the url variable doesnot always contain a single url. It depends on the type of page which you open. For example http://english.cntv.cn/program/china24/20130607/106071.shtml gives us only one value in the url but http://english.cntv.cn/program/newshour/20120307/118190.shtml gives us 5 urls. What should we do here ? Any suggestions ?

Originally created by @yasoob on GitHub (Jun 8, 2013). Hi sorry for disturbing again. I am making an IE for english.cntv.cn and having a little problem again :sweat_smile: my test code is ``` import requests import re import sys import json def get_url(url): _VALID_URL = r'(?:http://)?(?:www\.)?english.cntv\.cn/program/([^/]+)/([^/]+)/([^/]+)\.shtml' mobj = re.match(_VALID_URL, url) if mobj is None: print u'Invalid URL: %s' % url print "Opening main page" html = requests.get(url) id = re.search(r'fo.addVariable\("videoCenterId","(.*)"\);fo.addVariable\("channelId",channelId_code\)',html.text) editor = (re.search(r'<b>Editor:</b>(.*)<b>Source:',html.text)).group(1) editor = (editor.strip('|')).strip() print "Opening Info_page" info = json.loads(requests.get('http://vdn.apps.cntv.cn/api/getHttpVideoInfo.do?pid='+ id.group(1)).text) title = info['title'] video = info['video'] chapters = video['chapters2'] if 'chapters2' in video else video['chapters'] for x in chapters: urls = [x['url']] urls = [x['url'] for x in chapters] ext = "mp4" print {'url' : urls, 'title' : title, 'ext' : ext, 'editor' : editor } if __name__ == '__main__': url = sys.argv[-1] get_url(url) ``` Now the problem is that the url variable doesnot always contain a single url. It depends on the type of page which you open. For example http://english.cntv.cn/program/china24/20130607/106071.shtml gives us only one value in the url but http://english.cntv.cn/program/newshour/20120307/118190.shtml gives us 5 urls. What should we do here ? Any suggestions ?
Author
Owner

@jaimeMF commented on GitHub (Jun 27, 2013):

Try to pick the one with the best quality

@jaimeMF commented on GitHub (Jun 27, 2013): Try to pick the one with the best quality
Author
Owner

@yasoob commented on GitHub (Jun 29, 2013):

I think the videos are different parts. Just take a look at the urls:

http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-1.mp4
http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-2.mp4
http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-3.mp4
http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-4.mp4
http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-5.mp4

Only the last number is incrementing....... What do you say........And even when i open all 5 urls they contain different videos.......

@yasoob commented on GitHub (Jun 29, 2013): I think the videos are different parts. Just take a look at the urls: ``` http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-1.mp4 http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-2.mp4 http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-3.mp4 http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-4.mp4 http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-5.mp4 ``` Only the last number is incrementing....... What do you say........And even when i open all 5 urls they contain different videos.......
Author
Owner

@jaimeMF commented on GitHub (Jun 29, 2013):

Return 5 info_dicts, one for each url, it seems like the video is split in different parts. It may be better to return a playlist, I'm not sure.

@jaimeMF commented on GitHub (Jun 29, 2013): Return 5 info_dicts, one for each url, it seems like the video is split in different parts. It may be better to return a playlist, I'm not sure.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/youtube-dl-ytdl-org#642
No description provided.