Making an IE for english.cntv.cn #642

New issue

Closed

opened 2026-02-20 23:09:20 -05:00 by deekerman · 3 comments

deekerman commented

2026-02-20 23:09:20 -05:00

Owner

Originally created by @yasoob on GitHub (Jun 8, 2013).

Hi sorry for disturbing again. I am making an IE for english.cntv.cn and having a little problem again 😅 my test code is

import requests
import re
import sys
import json

def get_url(url):
    _VALID_URL = r'(?:http://)?(?:www\.)?english.cntv\.cn/program/([^/]+)/([^/]+)/([^/]+)\.shtml'
    mobj = re.match(_VALID_URL, url)
    if mobj is None:
        print u'Invalid URL: %s' % url
    print "Opening main page"
    html = requests.get(url)
    id = re.search(r'fo.addVariable\("videoCenterId","(.*)"\);fo.addVariable\("channelId",channelId_code\)',html.text)
    editor = (re.search(r'<b>Editor:</b>(.*)<b>Source:',html.text)).group(1)
    editor = (editor.strip('|')).strip()
    print "Opening Info_page"
    info = json.loads(requests.get('http://vdn.apps.cntv.cn/api/getHttpVideoInfo.do?pid='+ id.group(1)).text)
    title = info['title']
    video = info['video']
    chapters = video['chapters2'] if 'chapters2' in video else video['chapters']
    for x in chapters:
        urls = [x['url']]
    urls = [x['url'] for x in chapters]
    ext = "mp4"
    print {'url'     :  urls,
           'title'   :  title,
           'ext'     :  ext,
           'editor'  :  editor
    }

if __name__ == '__main__':
    url = sys.argv[-1]
    get_url(url)

Now the problem is that the url variable doesnot always contain a single url. It depends on the type of page which you open. For example http://english.cntv.cn/program/china24/20130607/106071.shtml gives us only one value in the url but http://english.cntv.cn/program/newshour/20120307/118190.shtml gives us 5 urls. What should we do here ? Any suggestions ?

Originally created by @yasoob on GitHub (Jun 8, 2013). Hi sorry for disturbing again. I am making an IE for english.cntv.cn and having a little problem again :sweat_smile: my test code is ``` import requests import re import sys import json def get_url(url): _VALID_URL = r'(?:http://)?(?:www\.)?english.cntv\.cn/program/([^/]+)/([^/]+)/([^/]+)\.shtml' mobj = re.match(_VALID_URL, url) if mobj is None: print u'Invalid URL: %s' % url print "Opening main page" html = requests.get(url) id = re.search(r'fo.addVariable\("videoCenterId","(.*)"\);fo.addVariable\("channelId",channelId_code\)',html.text) editor = (re.search(r'<b>Editor:</b>(.*)<b>Source:',html.text)).group(1) editor = (editor.strip('|')).strip() print "Opening Info_page" info = json.loads(requests.get('http://vdn.apps.cntv.cn/api/getHttpVideoInfo.do?pid='+ id.group(1)).text) title = info['title'] video = info['video'] chapters = video['chapters2'] if 'chapters2' in video else video['chapters'] for x in chapters: urls = [x['url']] urls = [x['url'] for x in chapters] ext = "mp4" print {'url' : urls, 'title' : title, 'ext' : ext, 'editor' : editor } if __name__ == '__main__': url = sys.argv[-1] get_url(url) ``` Now the problem is that the url variable doesnot always contain a single url. It depends on the type of page which you open. For example http://english.cntv.cn/program/china24/20130607/106071.shtml gives us only one value in the url but http://english.cntv.cn/program/newshour/20120307/118190.shtml gives us 5 urls. What should we do here ? Any suggestions ?

deekerman closed this issue

2026-02-20 23:09:22 -05:00

deekerman commented

2026-02-20 23:09:24 -05:00

Author

Owner

@jaimeMF commented on GitHub (Jun 27, 2013):

Try to pick the one with the best quality

@jaimeMF commented on GitHub (Jun 27, 2013): Try to pick the one with the best quality

deekerman commented

2026-02-20 23:09:25 -05:00

Author

Owner

@yasoob commented on GitHub (Jun 29, 2013):

I think the videos are different parts. Just take a look at the urls:

http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-1.mp4
http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-2.mp4
http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-3.mp4
http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-4.mp4
http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-5.mp4

Only the last number is incrementing....... What do you say........And even when i open all 5 urls they contain different videos.......

@yasoob commented on GitHub (Jun 29, 2013): I think the videos are different parts. Just take a look at the urls: ``` http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-1.mp4 http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-2.mp4 http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-3.mp4 http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-4.mp4 http://v.cctv.com/flash/mp4video19/TMS/2012/03/07/dd4c11e583c34d5d89a2b1fde0c4614c_h264818000nero_aac32-5.mp4 ``` Only the last number is incrementing....... What do you say........And even when i open all 5 urls they contain different videos.......

deekerman commented

2026-02-20 23:09:25 -05:00

Author

Owner

@jaimeMF commented on GitHub (Jun 29, 2013):

Return 5 info_dicts, one for each url, it seems like the video is split in different parts. It may be better to return a playlist, I'm not sure.

@jaimeMF commented on GitHub (Jun 29, 2013): Return 5 info_dicts, one for each url, it seems like the video is split in different parts. It may be better to return a playlist, I'm not sure.

deekerman referenced this issue

2026-02-20 23:21:30 -05:00

youtube start playing video from at specific time #1138

deekerman referenced this issue

2026-02-20 23:59:45 -05:00

Youtube video/mp3 cropping #1550

deekerman referenced this issue

2026-02-21 01:12:54 -05:00

time stamp in youtube URL #3256

deekerman referenced this issue

2026-02-21 01:20:33 -05:00

downloading snippets #3419