From: Rogério Brito Date: Mon, 30 Nov 2015 02:53:47 +0000 (-0200) Subject: Imported Upstream version 2015.11.27.1 X-Git-Url: https://git.rapsys.eu/youtubedl/commitdiff_plain/9ed7fe4fe4c445eb7d9f3197bb300d0db8f1807a Imported Upstream version 2015.11.27.1 --- diff --git a/README.md b/README.md index 38db97c..df419ab 100644 --- a/README.md +++ b/README.md @@ -319,7 +319,8 @@ which means you can modify it, redistribute it or use it however you like. --all-formats Download all available video formats --prefer-free-formats Prefer free video formats unless a specific one is requested - -F, --list-formats List all available formats + -F, --list-formats List all available formats of specified + videos --youtube-skip-dash-manifest Do not download the DASH manifests and related data on YouTube videos --merge-output-format FORMAT If a merge is required (e.g. @@ -329,8 +330,8 @@ which means you can modify it, redistribute it or use it however you like. ## Subtitle Options: --write-sub Write subtitle file - --write-auto-sub Write automatic subtitle file (YouTube - only) + --write-auto-sub Write automatically generated subtitle file + (YouTube only) --all-subs Download all the available subtitles of the video --list-subs List all available subtitles for the video @@ -534,6 +535,12 @@ Most people asking this question are not aware that youtube-dl now defaults to d Apparently YouTube requires you to pass a CAPTCHA test if you download too much. We're [considering to provide a way to let you solve the CAPTCHA](https://github.com/rg3/youtube-dl/issues/154), but at the moment, your best course of action is pointing a webbrowser to the youtube URL, solving the CAPTCHA, and restart youtube-dl. +### Do I need any other programs? + +youtube-dl works fine on its own on most sites. However, if you want to convert video/audio, you'll need [avconv](https://libav.org/) or [ffmpeg](https://www.ffmpeg.org/). On some sites - most notably YouTube - videos can be retrieved in a higher quality format without sound. youtube-dl will detect whether avconv/ffmpeg is present and automatically pick the best option. + +Videos or video formats streamed via RTMP protocol can only be downloaded when [rtmpdump](https://rtmpdump.mplayerhq.hu/) is installed. Downloading MMS and RTSP videos requires either [mplayer](http://mplayerhq.hu/) or [mpv](https://mpv.io/) to be installed. + ### I have downloaded a video but how can I play it? Once the video is fully downloaded, use any video player, such as [vlc](http://www.videolan.org) or [mplayer](http://www.mplayerhq.hu/). diff --git a/README.txt b/README.txt index fc369d2..54d4c33 100644 --- a/README.txt +++ b/README.txt @@ -350,7 +350,8 @@ Video Format Options: --all-formats Download all available video formats --prefer-free-formats Prefer free video formats unless a specific one is requested - -F, --list-formats List all available formats + -F, --list-formats List all available formats of specified + videos --youtube-skip-dash-manifest Do not download the DASH manifests and related data on YouTube videos --merge-output-format FORMAT If a merge is required (e.g. @@ -362,8 +363,8 @@ Video Format Options: Subtitle Options: --write-sub Write subtitle file - --write-auto-sub Write automatic subtitle file (YouTube - only) + --write-auto-sub Write automatically generated subtitle file + (YouTube only) --all-subs Download all the available subtitles of the video --list-subs List all available subtitles for the video @@ -697,6 +698,18 @@ CAPTCHA, but at the moment, your best course of action is pointing a webbrowser to the youtube URL, solving the CAPTCHA, and restart youtube-dl. +Do I need any other programs? + +youtube-dl works fine on its own on most sites. However, if you want to +convert video/audio, you'll need avconv or ffmpeg. On some sites - most +notably YouTube - videos can be retrieved in a higher quality format +without sound. youtube-dl will detect whether avconv/ffmpeg is present +and automatically pick the best option. + +Videos or video formats streamed via RTMP protocol can only be +downloaded when rtmpdump is installed. Downloading MMS and RTSP videos +requires either mplayer or mpv to be installed. + I have downloaded a video but how can I play it? Once the video is fully downloaded, use any video player, such as vlc or diff --git a/docs/supportedsites.md b/docs/supportedsites.md index a9820c1..1df4086 100644 --- a/docs/supportedsites.md +++ b/docs/supportedsites.md @@ -67,7 +67,8 @@ - **Bpb**: Bundeszentrale für politische Bildung - **BR**: Bayerischer Rundfunk Mediathek - **Break** - - **Brightcove** + - **brightcove:legacy** + - **brightcove:new** - **bt:article**: Bergens Tidende Articles - **bt:vestlendingen**: Bergens Tidende - Vestlendingen - **BuzzFeed** @@ -128,6 +129,7 @@ - **Discovery** - **Dotsub** - **DouyuTV**: 斗鱼 + - **DPlay** - **dramafever** - **dramafever:series** - **DRBonanza** @@ -200,7 +202,6 @@ - **GodTube** - **GoldenMoustache** - **Golem** - - **GorillaVid**: GorillaVid.in, daclips.in, movpod.in, fastvideo.in, realvid.net and filehoot.com - **Goshgay** - **Groupon** - **Hark** @@ -367,6 +368,7 @@ - **nowness:playlist** - **nowness:series** - **NowTV** + - **NowTVList** - **nowvideo**: NowVideo - **npo**: npo.nl and ntr.nl - **npo.nl:live** @@ -426,7 +428,6 @@ - **qqmusic:playlist**: QQ音乐 - 歌单 - **qqmusic:singer**: QQ音乐 - 歌手 - **qqmusic:toplist**: QQ音乐 - 排行榜 - - **Quickscope**: Quick Scope - **QuickVid** - **R7** - **radio.de** @@ -493,6 +494,7 @@ - **soompi:show** - **soundcloud** - **soundcloud:playlist** + - **soundcloud:search**: Soundcloud search - **soundcloud:set** - **soundcloud:user** - **soundgasm** @@ -671,6 +673,7 @@ - **WSJ**: Wall Street Journal - **XBef** - **XboxClips** + - **XFileShare**: XFileShare based sites: GorillaVid.in, daclips.in, movpod.in, fastvideo.in, realvid.net, filehoot.com and vidto.me - **XHamster** - **XHamsterEmbed** - **XMinus** @@ -705,6 +708,7 @@ - **youtube:show**: YouTube.com (multi-season) shows - **youtube:subscriptions**: YouTube.com subscriptions feed, "ytsubs" keyword (requires authentication) - **youtube:user**: YouTube.com user videos (URL or "ytuser" keyword) + - **youtube:user:playlists**: YouTube.com user playlists - **youtube:watchlater**: Youtube watch later list, ":ytwatchlater" for short (requires authentication) - **Zapiks** - **ZDF** diff --git a/test/test_utils.py b/test/test_utils.py index 01829f7..501355c 100644 --- a/test/test_utils.py +++ b/test/test_utils.py @@ -21,6 +21,7 @@ from youtube_dl.utils import ( clean_html, DateRange, detect_exe_version, + determine_ext, encodeFilename, escape_rfc3986, escape_url, @@ -210,8 +211,8 @@ class TestUtil(unittest.TestCase): self.assertEqual(unescapeHTML('%20;'), '%20;') self.assertEqual(unescapeHTML('/'), '/') self.assertEqual(unescapeHTML('/'), '/') - self.assertEqual( - unescapeHTML('é'), 'é') + self.assertEqual(unescapeHTML('é'), 'é') + self.assertEqual(unescapeHTML('�'), '�') def test_daterange(self): _20century = DateRange("19000101", "20000101") @@ -238,6 +239,13 @@ class TestUtil(unittest.TestCase): self.assertEqual(unified_strdate('25-09-2014'), '20140925') self.assertEqual(unified_strdate('UNKNOWN DATE FORMAT'), None) + def test_determine_ext(self): + self.assertEqual(determine_ext('http://example.com/foo/bar.mp4/?download'), 'mp4') + self.assertEqual(determine_ext('http://example.com/foo/bar/?download', None), None) + self.assertEqual(determine_ext('http://example.com/foo/bar.nonext/?download', None), None) + self.assertEqual(determine_ext('http://example.com/foo/bar/mp4?download', None), None) + self.assertEqual(determine_ext('http://example.com/foo/bar.m3u8//?download'), 'm3u8') + def test_find_xpath_attr(self): testxml = ''' diff --git a/youtube-dl b/youtube-dl index 3403fe9..ee414e4 100755 Binary files a/youtube-dl and b/youtube-dl differ diff --git a/youtube-dl.1 b/youtube-dl.1 index 78efa3d..d5967c8 100644 --- a/youtube-dl.1 +++ b/youtube-dl.1 @@ -636,7 +636,7 @@ Prefer free video formats unless a specific one is requested .RE .TP .B \-F, \-\-list\-formats -List all available formats +List all available formats of specified videos .RS .RE .TP @@ -660,7 +660,7 @@ Write subtitle file .RE .TP .B \-\-write\-auto\-sub -Write automatic subtitle file (YouTube only) +Write automatically generated subtitle file (YouTube only) .RS .RE .TP @@ -1132,6 +1132,20 @@ We\[aq]re considering to provide a way to let you solve the CAPTCHA (https://github.com/rg3/youtube-dl/issues/154), but at the moment, your best course of action is pointing a webbrowser to the youtube URL, solving the CAPTCHA, and restart youtube\-dl. +.SS Do I need any other programs? +.PP +youtube\-dl works fine on its own on most sites. +However, if you want to convert video/audio, you\[aq]ll need +avconv (https://libav.org/) or ffmpeg (https://www.ffmpeg.org/). +On some sites \- most notably YouTube \- videos can be retrieved in a +higher quality format without sound. +youtube\-dl will detect whether avconv/ffmpeg is present and +automatically pick the best option. +.PP +Videos or video formats streamed via RTMP protocol can only be +downloaded when rtmpdump (https://rtmpdump.mplayerhq.hu/) is installed. +Downloading MMS and RTSP videos requires either +mplayer (http://mplayerhq.hu/) or mpv (https://mpv.io/) to be installed. .SS I have downloaded a video but how can I play it? .PP Once the video is fully downloaded, use any video player, such as diff --git a/youtube-dl.fish b/youtube-dl.fish index 7ab5255..c70743c 100644 --- a/youtube-dl.fish +++ b/youtube-dl.fish @@ -107,12 +107,12 @@ complete --command youtube-dl --long-option sleep-interval --description 'Number complete --command youtube-dl --long-option format --short-option f --description 'Video format code, see the "FORMAT SELECTION" for all the info' complete --command youtube-dl --long-option all-formats --description 'Download all available video formats' complete --command youtube-dl --long-option prefer-free-formats --description 'Prefer free video formats unless a specific one is requested' -complete --command youtube-dl --long-option list-formats --short-option F --description 'List all available formats' +complete --command youtube-dl --long-option list-formats --short-option F --description 'List all available formats of specified videos' complete --command youtube-dl --long-option youtube-include-dash-manifest complete --command youtube-dl --long-option youtube-skip-dash-manifest --description 'Do not download the DASH manifests and related data on YouTube videos' complete --command youtube-dl --long-option merge-output-format --description 'If a merge is required (e.g. bestvideo+bestaudio), output to given container format. One of mkv, mp4, ogg, webm, flv. Ignored if no merge is required' complete --command youtube-dl --long-option write-sub --description 'Write subtitle file' -complete --command youtube-dl --long-option write-auto-sub --description 'Write automatic subtitle file (YouTube only)' +complete --command youtube-dl --long-option write-auto-sub --description 'Write automatically generated subtitle file (YouTube only)' complete --command youtube-dl --long-option all-subs --description 'Download all the available subtitles of the video' complete --command youtube-dl --long-option list-subs --description 'List all available subtitles for the video' complete --command youtube-dl --long-option sub-format --description 'Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"' diff --git a/youtube_dl/YoutubeDL.py b/youtube_dl/YoutubeDL.py index 1783ce0..9a8c7da 100755 --- a/youtube_dl/YoutubeDL.py +++ b/youtube_dl/YoutubeDL.py @@ -28,6 +28,7 @@ if os.name == 'nt': import ctypes from .compat import ( + compat_basestring, compat_cookiejar, compat_expanduser, compat_get_terminal_size, @@ -63,6 +64,7 @@ from .utils import ( SameFileError, sanitize_filename, sanitize_path, + sanitized_Request, std_headers, subtitles_filename, UnavailableVideoError, @@ -156,7 +158,7 @@ class YoutubeDL(object): writethumbnail: Write the thumbnail image to a file write_all_thumbnails: Write all thumbnail formats to files writesubtitles: Write the video subtitles to a file - writeautomaticsub: Write the automatic subtitles to a file + writeautomaticsub: Write the automatically generated subtitles to a file allsubtitles: Downloads all the subtitles of the video (requires writesubtitles or writeautomaticsub) listsubtitles: Lists all available subtitles for the video @@ -833,6 +835,7 @@ class YoutubeDL(object): extra_info=extra) playlist_results.append(entry_result) ie_result['entries'] = playlist_results + self.to_screen('[download] Finished downloading playlist: %s' % playlist) return ie_result elif result_type == 'compat_list': self.report_warning( @@ -937,7 +940,7 @@ class YoutubeDL(object): filter_parts.append(string) def _remove_unused_ops(tokens): - # Remove operators that we don't use and join them with the sourrounding strings + # Remove operators that we don't use and join them with the surrounding strings # for example: 'mp4' '-' 'baseline' '-' '16x9' is converted to 'mp4-baseline-16x9' ALLOWED_OPS = ('/', '+', ',', '(', ')') last_string, last_start, last_end, last_line = None, None, None, None @@ -1186,7 +1189,7 @@ class YoutubeDL(object): return res def _calc_cookies(self, info_dict): - pr = compat_urllib_request.Request(info_dict['url']) + pr = sanitized_Request(info_dict['url']) self.cookiejar.add_cookie_header(pr) return pr.get_header('Cookie') @@ -1870,6 +1873,8 @@ class YoutubeDL(object): def urlopen(self, req): """ Start an HTTP download """ + if isinstance(req, compat_basestring): + req = sanitized_Request(req) return self._opener.open(req, timeout=self._socket_timeout) def print_debug_header(self): diff --git a/youtube_dl/__init__.py b/youtube_dl/__init__.py index 5e2ed4d..9f131f5 100644 --- a/youtube_dl/__init__.py +++ b/youtube_dl/__init__.py @@ -377,7 +377,7 @@ def _real_main(argv=None): with YoutubeDL(ydl_opts) as ydl: # Update version if opts.update_self: - update_self(ydl.to_screen, opts.verbose) + update_self(ydl.to_screen, opts.verbose, ydl._opener) # Remove cache dir if opts.rm_cachedir: diff --git a/youtube_dl/downloader/common.py b/youtube_dl/downloader/common.py index 29a4500..b8bf8da 100644 --- a/youtube_dl/downloader/common.py +++ b/youtube_dl/downloader/common.py @@ -42,7 +42,7 @@ class FileDownloader(object): min_filesize: Skip files smaller than this size max_filesize: Skip files larger than this size xattr_set_filesize: Set ytdl.filesize user xattribute with expected size. - (experimenatal) + (experimental) external_downloader_args: A list of additional command-line arguments for the external downloader. diff --git a/youtube_dl/downloader/dash.py b/youtube_dl/downloader/dash.py index 8b6fa27..535f2a7 100644 --- a/youtube_dl/downloader/dash.py +++ b/youtube_dl/downloader/dash.py @@ -3,7 +3,7 @@ from __future__ import unicode_literals import re from .common import FileDownloader -from ..compat import compat_urllib_request +from ..utils import sanitized_Request class DashSegmentsFD(FileDownloader): @@ -22,7 +22,7 @@ class DashSegmentsFD(FileDownloader): def append_url_to_file(outf, target_url, target_name, remaining_bytes=None): self.to_screen('[DashSegments] %s: Downloading %s' % (info_dict['id'], target_name)) - req = compat_urllib_request.Request(target_url) + req = sanitized_Request(target_url) if remaining_bytes is not None: req.add_header('Range', 'bytes=0-%d' % (remaining_bytes - 1)) diff --git a/youtube_dl/downloader/hls.py b/youtube_dl/downloader/hls.py index 9a83a73..92765a3 100644 --- a/youtube_dl/downloader/hls.py +++ b/youtube_dl/downloader/hls.py @@ -35,7 +35,7 @@ class HlsFD(FileDownloader): # [http @ 00000000003d2fa0] No trailing CRLF found in HTTP header. args += [ '-headers', - ''.join('%s: %s\r\n' % (key, val) for key, val in info_dict['http_headers'].items())] + ''.join('%s: %s\r\n' % (key, val) for key, val in info_dict['http_headers'].items() if key.lower() != 'accept-encoding')] args += ['-i', url, '-f', 'mp4', '-c', 'copy', '-bsf:a', 'aac_adtstoasc'] diff --git a/youtube_dl/downloader/http.py b/youtube_dl/downloader/http.py index a29f5cf..56840e0 100644 --- a/youtube_dl/downloader/http.py +++ b/youtube_dl/downloader/http.py @@ -7,14 +7,12 @@ import time import re from .common import FileDownloader -from ..compat import ( - compat_urllib_request, - compat_urllib_error, -) +from ..compat import compat_urllib_error from ..utils import ( ContentTooShortError, encodeFilename, sanitize_open, + sanitized_Request, ) @@ -29,8 +27,8 @@ class HttpFD(FileDownloader): add_headers = info_dict.get('http_headers') if add_headers: headers.update(add_headers) - basic_request = compat_urllib_request.Request(url, None, headers) - request = compat_urllib_request.Request(url, None, headers) + basic_request = sanitized_Request(url, None, headers) + request = sanitized_Request(url, None, headers) is_test = self.params.get('test', False) diff --git a/youtube_dl/downloader/rtmp.py b/youtube_dl/downloader/rtmp.py index f1d219b..14d56db 100644 --- a/youtube_dl/downloader/rtmp.py +++ b/youtube_dl/downloader/rtmp.py @@ -117,7 +117,7 @@ class RtmpFD(FileDownloader): return False # Download using rtmpdump. rtmpdump returns exit code 2 when - # the connection was interrumpted and resuming appears to be + # the connection was interrupted and resuming appears to be # possible. This is part of rtmpdump's normal usage, AFAIK. basic_args = [ 'rtmpdump', '--verbose', '-r', url, diff --git a/youtube_dl/extractor/__init__.py b/youtube_dl/extractor/__init__.py index 0a90da7..947b836 100644 --- a/youtube_dl/extractor/__init__.py +++ b/youtube_dl/extractor/__init__.py @@ -60,7 +60,10 @@ from .bloomberg import BloombergIE from .bpb import BpbIE from .br import BRIE from .breakcom import BreakIE -from .brightcove import BrightcoveIE +from .brightcove import ( + BrightcoveLegacyIE, + BrightcoveNewIE, +) from .buzzfeed import BuzzFeedIE from .byutv import BYUtvIE from .c56 import C56IE @@ -129,6 +132,7 @@ from .dfb import DFBIE from .dhm import DHMIE from .dotsub import DotsubIE from .douyutv import DouyuTVIE +from .dplay import DPlayIE from .dramafever import ( DramaFeverIE, DramaFeverSeriesIE, @@ -221,7 +225,6 @@ from .goldenmoustache import GoldenMoustacheIE from .golem import GolemIE from .googleplus import GooglePlusIE from .googlesearch import GoogleSearchIE -from .gorillavid import GorillaVidIE from .goshgay import GoshgayIE from .groupon import GrouponIE from .hark import HarkIE @@ -418,7 +421,10 @@ from .nowness import ( NownessPlaylistIE, NownessSeriesIE, ) -from .nowtv import NowTVIE +from .nowtv import ( + NowTVIE, + NowTVListIE, +) from .nowvideo import NowVideoIE from .npo import ( NPOIE, @@ -456,10 +462,7 @@ from .orf import ( from .parliamentliveuk import ParliamentLiveUKIE from .patreon import PatreonIE from .pbs import PBSIE -from .periscope import ( - PeriscopeIE, - QuickscopeIE, -) +from .periscope import PeriscopeIE from .philharmoniedeparis import PhilharmonieDeParisIE from .phoenix import PhoenixIE from .photobucket import PhotobucketIE @@ -573,7 +576,8 @@ from .soundcloud import ( SoundcloudIE, SoundcloudSetIE, SoundcloudUserIE, - SoundcloudPlaylistIE + SoundcloudPlaylistIE, + SoundcloudSearchIE ) from .soundgasm import ( SoundgasmIE, @@ -786,6 +790,7 @@ from .wrzuta import WrzutaIE from .wsj import WSJIE from .xbef import XBefIE from .xboxclips import XboxClipsIE +from .xfileshare import XFileShareIE from .xhamster import ( XHamsterIE, XHamsterEmbedIE, @@ -829,6 +834,7 @@ from .youtube import ( YoutubeTruncatedIDIE, YoutubeTruncatedURLIE, YoutubeUserIE, + YoutubeUserPlaylistsIE, YoutubeWatchLaterIE, ) from .zapiks import ZapiksIE diff --git a/youtube_dl/extractor/aljazeera.py b/youtube_dl/extractor/aljazeera.py index 184a14a..5b2c0dc 100644 --- a/youtube_dl/extractor/aljazeera.py +++ b/youtube_dl/extractor/aljazeera.py @@ -15,7 +15,7 @@ class AlJazeeraIE(InfoExtractor): 'description': 'As a birth attendant advocating for family planning, Remy is on the frontline of Tondo\'s battle with overcrowding.', 'uploader': 'Al Jazeera English', }, - 'add_ie': ['Brightcove'], + 'add_ie': ['BrightcoveLegacy'], 'skip': 'Not accessible from Travis CI server', } @@ -32,5 +32,5 @@ class AlJazeeraIE(InfoExtractor): 'playerKey=AQ~~%2CAAAAmtVJIFk~%2CTVGOQ5ZTwJbeMWnq5d_H4MOM57xfzApc' '&%40videoPlayer={0}'.format(brightcove_id) ), - 'ie_key': 'Brightcove', + 'ie_key': 'BrightcoveLegacy', } diff --git a/youtube_dl/extractor/atresplayer.py b/youtube_dl/extractor/atresplayer.py index 29f8795..50e47ba 100644 --- a/youtube_dl/extractor/atresplayer.py +++ b/youtube_dl/extractor/atresplayer.py @@ -7,11 +7,11 @@ from .common import InfoExtractor from ..compat import ( compat_str, compat_urllib_parse, - compat_urllib_request, ) from ..utils import ( int_or_none, float_or_none, + sanitized_Request, xpath_text, ExtractorError, ) @@ -63,7 +63,7 @@ class AtresPlayerIE(InfoExtractor): 'j_password': password, } - request = compat_urllib_request.Request( + request = sanitized_Request( self._LOGIN_URL, compat_urllib_parse.urlencode(login_form).encode('utf-8')) request.add_header('Content-Type', 'application/x-www-form-urlencoded') response = self._download_webpage( @@ -94,7 +94,7 @@ class AtresPlayerIE(InfoExtractor): formats = [] for fmt in ['windows', 'android_tablet']: - request = compat_urllib_request.Request( + request = sanitized_Request( self._URL_VIDEO_TEMPLATE.format(fmt, episode_id, timestamp_shifted, token)) request.add_header('User-Agent', self._USER_AGENT) diff --git a/youtube_dl/extractor/bambuser.py b/youtube_dl/extractor/bambuser.py index 8dff1d6..da986e0 100644 --- a/youtube_dl/extractor/bambuser.py +++ b/youtube_dl/extractor/bambuser.py @@ -6,13 +6,13 @@ import itertools from .common import InfoExtractor from ..compat import ( compat_urllib_parse, - compat_urllib_request, compat_str, ) from ..utils import ( ExtractorError, int_or_none, float_or_none, + sanitized_Request, ) @@ -57,7 +57,7 @@ class BambuserIE(InfoExtractor): 'pass': password, } - request = compat_urllib_request.Request( + request = sanitized_Request( self._LOGIN_URL, compat_urllib_parse.urlencode(login_form).encode('utf-8')) request.add_header('Referer', self._LOGIN_URL) response = self._download_webpage( @@ -126,7 +126,7 @@ class BambuserChannelIE(InfoExtractor): '&sort=created&access_mode=0%2C1%2C2&limit={count}' '&method=broadcast&format=json&vid_older_than={last}' ).format(user=user, count=self._STEP, last=last_id) - req = compat_urllib_request.Request(req_url) + req = sanitized_Request(req_url) # Without setting this header, we wouldn't get any result req.add_header('Referer', 'http://bambuser.com/channel/%s' % user) data = self._download_json( diff --git a/youtube_dl/extractor/bbc.py b/youtube_dl/extractor/bbc.py index a55a6db..33b296e 100644 --- a/youtube_dl/extractor/bbc.py +++ b/youtube_dl/extractor/bbc.py @@ -27,7 +27,7 @@ class BBCCoUkIE(InfoExtractor): _MEDIASELECTOR_URLS = [ # Provides HQ HLS streams with even better quality that pc mediaset but fails # with geolocation in some cases when it's even not geo restricted at all (e.g. - # http://www.bbc.co.uk/programmes/b06bp7lf) + # http://www.bbc.co.uk/programmes/b06bp7lf). Also may fail with selectionunavailable. 'http://open.live.bbc.co.uk/mediaselector/5/select/version/2.0/mediaset/iptv-all/vpid/%s', 'http://open.live.bbc.co.uk/mediaselector/5/select/version/2.0/mediaset/pc/vpid/%s', ] @@ -334,7 +334,7 @@ class BBCCoUkIE(InfoExtractor): return self._download_media_selector_url( mediaselector_url % programme_id, programme_id) except BBCCoUkIE.MediaSelectionError as e: - if e.id in ('notukerror', 'geolocation'): + if e.id in ('notukerror', 'geolocation', 'selectionunavailable'): last_exception = e continue self._raise_extractor_error(e) @@ -345,7 +345,7 @@ class BBCCoUkIE(InfoExtractor): media_selection = self._download_xml( url, programme_id, 'Downloading media selection XML') except ExtractorError as ee: - if isinstance(ee.cause, compat_HTTPError) and ee.cause.code == 403: + if isinstance(ee.cause, compat_HTTPError) and ee.cause.code in (403, 404): media_selection = compat_etree_fromstring(ee.cause.read().decode('utf-8')) else: raise diff --git a/youtube_dl/extractor/bliptv.py b/youtube_dl/extractor/bliptv.py index c329628..35375f7 100644 --- a/youtube_dl/extractor/bliptv.py +++ b/youtube_dl/extractor/bliptv.py @@ -4,14 +4,12 @@ import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, - compat_urlparse, -) +from ..compat import compat_urlparse from ..utils import ( clean_html, int_or_none, parse_iso8601, + sanitized_Request, unescapeHTML, xpath_text, xpath_with_ns, @@ -219,7 +217,7 @@ class BlipTVIE(InfoExtractor): for lang, url in subtitles_urls.items(): # For some weird reason, blip.tv serves a video instead of subtitles # when we request with a common UA - req = compat_urllib_request.Request(url) + req = sanitized_Request(url) req.add_header('User-Agent', 'youtube-dl') subtitles[lang] = [{ # The extension is 'srt' but it's actually an 'ass' file diff --git a/youtube_dl/extractor/bloomberg.py b/youtube_dl/extractor/bloomberg.py index 0dca29b..11ace91 100644 --- a/youtube_dl/extractor/bloomberg.py +++ b/youtube_dl/extractor/bloomberg.py @@ -6,9 +6,9 @@ from .common import InfoExtractor class BloombergIE(InfoExtractor): - _VALID_URL = r'https?://www\.bloomberg\.com/news/videos/[^/]+/(?P[^/?#]+)' + _VALID_URL = r'https?://www\.bloomberg\.com/news/[^/]+/[^/]+/(?P[^/?#]+)' - _TEST = { + _TESTS = [{ 'url': 'http://www.bloomberg.com/news/videos/b/aaeae121-5949-481e-a1ce-4562db6f5df2', # The md5 checksum changes 'info_dict': { @@ -17,7 +17,10 @@ class BloombergIE(InfoExtractor): 'title': 'Shah\'s Presentation on Foreign-Exchange Strategies', 'description': 'md5:a8ba0302912d03d246979735c17d2761', }, - } + }, { + 'url': 'http://www.bloomberg.com/news/articles/2015-11-12/five-strange-things-that-have-been-happening-in-financial-markets', + 'only_matching': True, + }] def _real_extract(self, url): name = self._match_id(url) diff --git a/youtube_dl/extractor/brightcove.py b/youtube_dl/extractor/brightcove.py index 1686cdd..f5ebae1 100644 --- a/youtube_dl/extractor/brightcove.py +++ b/youtube_dl/extractor/brightcove.py @@ -11,7 +11,6 @@ from ..compat import ( compat_str, compat_urllib_parse, compat_urllib_parse_urlparse, - compat_urllib_request, compat_urlparse, compat_xml_parse_error, ) @@ -20,12 +19,18 @@ from ..utils import ( ExtractorError, find_xpath_attr, fix_xml_ampersands, + float_or_none, + js_to_json, + int_or_none, + parse_iso8601, + sanitized_Request, unescapeHTML, unsmuggle_url, ) -class BrightcoveIE(InfoExtractor): +class BrightcoveLegacyIE(InfoExtractor): + IE_NAME = 'brightcove:legacy' _VALID_URL = r'(?:https?://.*brightcove\.com/(services|viewer).*?\?|brightcove:)(?P.*)' _FEDERATED_URL_TEMPLATE = 'http://c.brightcove.com/services/viewer/htmlFederated?%s' @@ -245,7 +250,7 @@ class BrightcoveIE(InfoExtractor): def _get_video_info(self, video_id, query_str, query, referer=None): request_url = self._FEDERATED_URL_TEMPLATE % query_str - req = compat_urllib_request.Request(request_url) + req = sanitized_Request(request_url) linkBase = query.get('linkBaseURL') if linkBase is not None: referer = linkBase[0] @@ -346,3 +351,172 @@ class BrightcoveIE(InfoExtractor): if 'url' not in info and not info.get('formats'): raise ExtractorError('Unable to extract video url for %s' % info['id']) return info + + +class BrightcoveNewIE(InfoExtractor): + IE_NAME = 'brightcove:new' + _VALID_URL = r'https?://players\.brightcove\.net/(?P\d+)/(?P[^/]+)_(?P[^/]+)/index\.html\?.*videoId=(?P\d+)' + _TESTS = [{ + 'url': 'http://players.brightcove.net/929656772001/e41d32dc-ec74-459e-a845-6c69f7b724ea_default/index.html?videoId=4463358922001', + 'md5': 'c8100925723840d4b0d243f7025703be', + 'info_dict': { + 'id': '4463358922001', + 'ext': 'mp4', + 'title': 'Meet the man behind Popcorn Time', + 'description': 'md5:eac376a4fe366edc70279bfb681aea16', + 'duration': 165.768, + 'timestamp': 1441391203, + 'upload_date': '20150904', + 'uploader_id': '929656772001', + 'formats': 'mincount:22', + }, + }, { + # with rtmp streams + 'url': 'http://players.brightcove.net/4036320279001/5d112ed9-283f-485f-a7f9-33f42e8bc042_default/index.html?videoId=4279049078001', + 'info_dict': { + 'id': '4279049078001', + 'ext': 'mp4', + 'title': 'Titansgrave: Chapter 0', + 'description': 'Titansgrave: Chapter 0', + 'duration': 1242.058, + 'timestamp': 1433556729, + 'upload_date': '20150606', + 'uploader_id': '4036320279001', + 'formats': 'mincount:41', + }, + 'params': { + 'skip_download': True, + } + }] + + @staticmethod + def _extract_urls(webpage): + # Reference: + # 1. http://docs.brightcove.com/en/video-cloud/brightcove-player/guides/publish-video.html#setvideoiniframe + # 2. http://docs.brightcove.com/en/video-cloud/brightcove-player/guides/publish-video.html#setvideousingjavascript) + # 3. http://docs.brightcove.com/en/video-cloud/brightcove-player/guides/embed-in-page.html + + entries = [] + + # Look for iframe embeds [1] + for _, url in re.findall( + r']+src=(["\'])((?:https?:)//players\.brightcove\.net/\d+/[^/]+/index\.html.+?)\1', webpage): + entries.append(url) + + # Look for embed_in_page embeds [2] + for video_id, account_id, player_id, embed in re.findall( + # According to examples from [3] it's unclear whether video id + # may be optional and what to do when it is + r'''(?sx) + ]+ + data-video-id=["\'](\d+)["\'][^>]*>.*? + .*? + ]+ + src=["\'](?:https?:)?//players\.brightcove\.net/ + (\d+)/([\da-f-]+)_([^/]+)/index\.min\.js + ''', webpage): + entries.append( + 'http://players.brightcove.net/%s/%s_%s/index.html?videoId=%s' + % (account_id, player_id, embed, video_id)) + + return entries + + def _real_extract(self, url): + account_id, player_id, embed, video_id = re.match(self._VALID_URL, url).groups() + + webpage = self._download_webpage( + 'http://players.brightcove.net/%s/%s_%s/index.min.js' + % (account_id, player_id, embed), video_id) + + policy_key = None + + catalog = self._search_regex( + r'catalog\(({.+?})\);', webpage, 'catalog', default=None) + if catalog: + catalog = self._parse_json( + js_to_json(catalog), video_id, fatal=False) + if catalog: + policy_key = catalog.get('policyKey') + + if not policy_key: + policy_key = self._search_regex( + r'policyKey\s*:\s*(["\'])(?P.+?)\1', + webpage, 'policy key', group='pk') + + req = sanitized_Request( + 'https://edge.api.brightcove.com/playback/v1/accounts/%s/videos/%s' + % (account_id, video_id), + headers={'Accept': 'application/json;pk=%s' % policy_key}) + json_data = self._download_json(req, video_id) + + title = json_data['name'] + + formats = [] + for source in json_data.get('sources', []): + source_type = source.get('type') + src = source.get('src') + if source_type == 'application/x-mpegURL': + if not src: + continue + m3u8_formats = self._extract_m3u8_formats( + src, video_id, 'mp4', entry_protocol='m3u8_native', + m3u8_id='hls', fatal=False) + if m3u8_formats: + formats.extend(m3u8_formats) + else: + streaming_src = source.get('streaming_src') + stream_name, app_name = source.get('stream_name'), source.get('app_name') + if not src and not streaming_src and (not stream_name or not app_name): + continue + tbr = float_or_none(source.get('avg_bitrate'), 1000) + height = int_or_none(source.get('height')) + f = { + 'tbr': tbr, + 'width': int_or_none(source.get('width')), + 'height': height, + 'filesize': int_or_none(source.get('size')), + 'container': source.get('container'), + 'vcodec': source.get('codec'), + 'ext': source.get('container').lower(), + } + + def build_format_id(kind): + format_id = kind + if tbr: + format_id += '-%dk' % int(tbr) + if height: + format_id += '-%dp' % height + return format_id + + if src or streaming_src: + f.update({ + 'url': src or streaming_src, + 'format_id': build_format_id('http' if src else 'http-streaming'), + 'preference': 2 if src else 1, + }) + else: + f.update({ + 'url': app_name, + 'play_path': stream_name, + 'format_id': build_format_id('rtmp'), + }) + formats.append(f) + self._sort_formats(formats) + + description = json_data.get('description') + thumbnail = json_data.get('thumbnail') + timestamp = parse_iso8601(json_data.get('published_at')) + duration = float_or_none(json_data.get('duration'), 1000) + tags = json_data.get('tags', []) + + return { + 'id': video_id, + 'title': title, + 'description': description, + 'thumbnail': thumbnail, + 'duration': duration, + 'timestamp': timestamp, + 'uploader_id': account_id, + 'formats': formats, + 'tags': tags, + } diff --git a/youtube_dl/extractor/cbs.py b/youtube_dl/extractor/cbs.py index 75fffb1..40d07ab 100644 --- a/youtube_dl/extractor/cbs.py +++ b/youtube_dl/extractor/cbs.py @@ -1,6 +1,10 @@ from __future__ import unicode_literals from .common import InfoExtractor +from ..utils import ( + sanitized_Request, + smuggle_url, +) class CBSIE(InfoExtractor): @@ -46,13 +50,19 @@ class CBSIE(InfoExtractor): def _real_extract(self, url): display_id = self._match_id(url) - webpage = self._download_webpage(url, display_id) + request = sanitized_Request(url) + # Android UA is served with higher quality (720p) streams (see + # https://github.com/rg3/youtube-dl/issues/7490) + request.add_header('User-Agent', 'Mozilla/5.0 (Linux; Android 4.4; Nexus 5)') + webpage = self._download_webpage(request, display_id) real_id = self._search_regex( [r"video\.settings\.pid\s*=\s*'([^']+)';", r"cbsplayer\.pid\s*=\s*'([^']+)';"], webpage, 'real video ID') return { '_type': 'url_transparent', 'ie_key': 'ThePlatform', - 'url': 'theplatform:%s' % real_id, + 'url': smuggle_url( + 'http://link.theplatform.com/s/dJ5BDC/%s?mbr=true&manifest=m3u' % real_id, + {'force_smil_url': True}), 'display_id': display_id, } diff --git a/youtube_dl/extractor/cbsnews.py b/youtube_dl/extractor/cbsnews.py index 52e61d8..f9a64a0 100644 --- a/youtube_dl/extractor/cbsnews.py +++ b/youtube_dl/extractor/cbsnews.py @@ -67,9 +67,12 @@ class CBSNewsIE(InfoExtractor): 'format_id': format_id, } if uri.startswith('rtmp'): + play_path = re.sub( + r'{slistFilePath}', '', + uri.split('')[-1].split('{break}')[-1]) fmt.update({ 'app': 'ondemand?auth=cbs', - 'play_path': 'mp4:' + uri.split('')[-1], + 'play_path': 'mp4:' + play_path, 'player_url': 'http://www.cbsnews.com/[[IMPORT]]/vidtech.cbsinteractive.com/player/3_3_0/CBSI_PLAYER_HD.swf', 'page_url': 'http://www.cbsnews.com', 'ext': 'flv', diff --git a/youtube_dl/extractor/ceskatelevize.py b/youtube_dl/extractor/ceskatelevize.py index e857e66..6f7b2a7 100644 --- a/youtube_dl/extractor/ceskatelevize.py +++ b/youtube_dl/extractor/ceskatelevize.py @@ -5,7 +5,6 @@ import re from .common import InfoExtractor from ..compat import ( - compat_urllib_request, compat_urllib_parse, compat_urllib_parse_unquote, compat_urllib_parse_urlparse, @@ -13,6 +12,7 @@ from ..compat import ( from ..utils import ( ExtractorError, float_or_none, + sanitized_Request, ) @@ -100,7 +100,7 @@ class CeskaTelevizeIE(InfoExtractor): 'requestSource': 'iVysilani', } - req = compat_urllib_request.Request( + req = sanitized_Request( 'http://www.ceskatelevize.cz/ivysilani/ajax/get-client-playlist', data=compat_urllib_parse.urlencode(data)) @@ -115,7 +115,7 @@ class CeskaTelevizeIE(InfoExtractor): if playlist_url == 'error_region': raise ExtractorError(NOT_AVAILABLE_STRING, expected=True) - req = compat_urllib_request.Request(compat_urllib_parse_unquote(playlist_url)) + req = sanitized_Request(compat_urllib_parse_unquote(playlist_url)) req.add_header('Referer', url) playlist_title = self._og_search_title(webpage) diff --git a/youtube_dl/extractor/collegerama.py b/youtube_dl/extractor/collegerama.py index fedd484..40667a0 100644 --- a/youtube_dl/extractor/collegerama.py +++ b/youtube_dl/extractor/collegerama.py @@ -3,10 +3,10 @@ from __future__ import unicode_literals import json from .common import InfoExtractor -from ..compat import compat_urllib_request from ..utils import ( float_or_none, int_or_none, + sanitized_Request, ) @@ -52,7 +52,7 @@ class CollegeRamaIE(InfoExtractor): } } - request = compat_urllib_request.Request( + request = sanitized_Request( 'http://collegerama.tudelft.nl/Mediasite/PlayerService/PlayerService.svc/json/GetPlayerOptions', json.dumps(player_options_request)) request.add_header('Content-Type', 'application/json') diff --git a/youtube_dl/extractor/common.py b/youtube_dl/extractor/common.py index 5e263f8..eb9bfa3 100644 --- a/youtube_dl/extractor/common.py +++ b/youtube_dl/extractor/common.py @@ -19,7 +19,6 @@ from ..compat import ( compat_urllib_error, compat_urllib_parse, compat_urllib_parse_urlparse, - compat_urllib_request, compat_urlparse, compat_str, compat_etree_fromstring, @@ -37,6 +36,7 @@ from ..utils import ( int_or_none, RegexNotFoundError, sanitize_filename, + sanitized_Request, unescapeHTML, unified_strdate, url_basename, @@ -891,6 +891,11 @@ class InfoExtractor(object): if not media_nodes: manifest_version = '2.0' media_nodes = manifest.findall('{http://ns.adobe.com/f4m/2.0}media') + base_url = xpath_text( + manifest, ['{http://ns.adobe.com/f4m/1.0}baseURL', '{http://ns.adobe.com/f4m/2.0}baseURL'], + 'base URL', default=None) + if base_url: + base_url = base_url.strip() for i, media_el in enumerate(media_nodes): if manifest_version == '2.0': media_url = media_el.attrib.get('href') or media_el.attrib.get('url') @@ -898,7 +903,7 @@ class InfoExtractor(object): continue manifest_url = ( media_url if media_url.startswith('http://') or media_url.startswith('https://') - else ('/'.join(manifest_url.split('/')[:-1]) + '/' + media_url)) + else ((base_url or '/'.join(manifest_url.split('/')[:-1])) + '/' + media_url)) # If media_url is itself a f4m manifest do the recursive extraction # since bitrates in parent manifest (this one) and media_url manifest # may differ leading to inability to resolve the format by requested @@ -1280,7 +1285,7 @@ class InfoExtractor(object): def _get_cookies(self, url): """ Return a compat_cookies.SimpleCookie with the cookies for the url """ - req = compat_urllib_request.Request(url) + req = sanitized_Request(url) self._downloader.cookiejar.add_cookie_header(req) return compat_cookies.SimpleCookie(req.get_header('Cookie')) diff --git a/youtube_dl/extractor/crunchyroll.py b/youtube_dl/extractor/crunchyroll.py index 6e5999c..00d943f 100644 --- a/youtube_dl/extractor/crunchyroll.py +++ b/youtube_dl/extractor/crunchyroll.py @@ -23,6 +23,7 @@ from ..utils import ( int_or_none, lowercase_escape, remove_end, + sanitized_Request, unified_strdate, urlencode_postdata, xpath_text, @@ -46,7 +47,7 @@ class CrunchyrollBaseIE(InfoExtractor): 'name': username, 'password': password, }) - login_request = compat_urllib_request.Request(login_url, data) + login_request = sanitized_Request(login_url, data) login_request.add_header('Content-Type', 'application/x-www-form-urlencoded') self._download_webpage(login_request, None, False, 'Wrong login info') @@ -55,7 +56,7 @@ class CrunchyrollBaseIE(InfoExtractor): def _download_webpage(self, url_or_request, video_id, note=None, errnote=None, fatal=True, tries=1, timeout=5, encoding=None): request = (url_or_request if isinstance(url_or_request, compat_urllib_request.Request) - else compat_urllib_request.Request(url_or_request)) + else sanitized_Request(url_or_request)) # Accept-Language must be set explicitly to accept any language to avoid issues # similar to https://github.com/rg3/youtube-dl/issues/6797. # Along with IP address Crunchyroll uses Accept-Language to guess whether georestriction @@ -307,7 +308,7 @@ Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text 'video_uploader', fatal=False) playerdata_url = compat_urllib_parse_unquote(self._html_search_regex(r'"config_url":"([^"]+)', webpage, 'playerdata_url')) - playerdata_req = compat_urllib_request.Request(playerdata_url) + playerdata_req = sanitized_Request(playerdata_url) playerdata_req.data = compat_urllib_parse.urlencode({'current_page': webpage_url}) playerdata_req.add_header('Content-Type', 'application/x-www-form-urlencoded') playerdata = self._download_webpage(playerdata_req, video_id, note='Downloading media info') @@ -319,7 +320,7 @@ Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text for fmt in re.findall(r'showmedia\.([0-9]{3,4})p', webpage): stream_quality, stream_format = self._FORMAT_IDS[fmt] video_format = fmt + 'p' - streamdata_req = compat_urllib_request.Request( + streamdata_req = sanitized_Request( 'http://www.crunchyroll.com/xml/?req=RpcApiVideoPlayer_GetStandardConfig&media_id=%s&video_format=%s&video_quality=%s' % (stream_id, stream_format, stream_quality), compat_urllib_parse.urlencode({'current_page': url}).encode('utf-8')) diff --git a/youtube_dl/extractor/dailymotion.py b/youtube_dl/extractor/dailymotion.py index bc78239..ab7f3ae 100644 --- a/youtube_dl/extractor/dailymotion.py +++ b/youtube_dl/extractor/dailymotion.py @@ -7,15 +7,13 @@ import itertools from .common import InfoExtractor -from ..compat import ( - compat_str, - compat_urllib_request, -) +from ..compat import compat_str from ..utils import ( ExtractorError, determine_ext, int_or_none, parse_iso8601, + sanitized_Request, str_to_int, unescapeHTML, ) @@ -25,7 +23,7 @@ class DailymotionBaseInfoExtractor(InfoExtractor): @staticmethod def _build_request(url): """Build a request with the family filter disabled""" - request = compat_urllib_request.Request(url) + request = sanitized_Request(url) request.add_header('Cookie', 'family_filter=off; ff=off') return request diff --git a/youtube_dl/extractor/dcn.py b/youtube_dl/extractor/dcn.py index 6f2fea5..9737cff 100644 --- a/youtube_dl/extractor/dcn.py +++ b/youtube_dl/extractor/dcn.py @@ -2,13 +2,11 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( int_or_none, parse_iso8601, + sanitized_Request, ) @@ -36,7 +34,7 @@ class DCNIE(InfoExtractor): def _real_extract(self, url): video_id = self._match_id(url) - request = compat_urllib_request.Request( + request = sanitized_Request( 'http://admin.mangomolo.com/analytics/index.php/plus/video?id=%s' % video_id, headers={'Origin': 'http://www.dcndigital.ae'}) diff --git a/youtube_dl/extractor/dplay.py b/youtube_dl/extractor/dplay.py new file mode 100644 index 0000000..6cda56a --- /dev/null +++ b/youtube_dl/extractor/dplay.py @@ -0,0 +1,51 @@ +# encoding: utf-8 +from __future__ import unicode_literals + +import time + +from .common import InfoExtractor +from ..utils import int_or_none + + +class DPlayIE(InfoExtractor): + _VALID_URL = r'http://www\.dplay\.se/[^/]+/(?P[^/?#]+)' + + _TEST = { + 'url': 'http://www.dplay.se/nugammalt-77-handelser-som-format-sverige/season-1-svensken-lar-sig-njuta-av-livet/', + 'info_dict': { + 'id': '3172', + 'ext': 'mp4', + 'display_id': 'season-1-svensken-lar-sig-njuta-av-livet', + 'title': 'Svensken lär sig njuta av livet', + 'duration': 2650, + }, + } + + def _real_extract(self, url): + display_id = self._match_id(url) + webpage = self._download_webpage(url, display_id) + video_id = self._search_regex( + r'data-video-id="(\d+)"', webpage, 'video id') + + info = self._download_json( + 'http://www.dplay.se/api/v2/ajax/videos?video_id=' + video_id, + video_id)['data'][0] + + self._set_cookie( + 'secure.dplay.se', 'dsc-geo', + '{"countryCode":"NL","expiry":%d}' % ((time.time() + 20 * 60) * 1000)) + # TODO: consider adding support for 'stream_type=hds', it seems to + # require setting some cookies + manifest_url = self._download_json( + 'https://secure.dplay.se/secure/api/v2/user/authorization/stream/%s?stream_type=hls' % video_id, + video_id, 'Getting manifest url for hls stream')['hls'] + formats = self._extract_m3u8_formats( + manifest_url, video_id, ext='mp4', entry_protocol='m3u8_native') + + return { + 'id': video_id, + 'display_id': display_id, + 'title': info['title'], + 'formats': formats, + 'duration': int_or_none(info.get('video_metadata_length'), scale=1000), + } diff --git a/youtube_dl/extractor/dramafever.py b/youtube_dl/extractor/dramafever.py index 38e6597..d836c1a 100644 --- a/youtube_dl/extractor/dramafever.py +++ b/youtube_dl/extractor/dramafever.py @@ -7,7 +7,6 @@ from .common import InfoExtractor from ..compat import ( compat_HTTPError, compat_urllib_parse, - compat_urllib_request, compat_urlparse, ) from ..utils import ( @@ -16,6 +15,7 @@ from ..utils import ( determine_ext, int_or_none, parse_iso8601, + sanitized_Request, ) @@ -51,7 +51,7 @@ class DramaFeverBaseIE(InfoExtractor): 'password': password, } - request = compat_urllib_request.Request( + request = sanitized_Request( self._LOGIN_URL, compat_urllib_parse.urlencode(login_form).encode('utf-8')) response = self._download_webpage( request, None, 'Logging in as %s' % username) diff --git a/youtube_dl/extractor/dumpert.py b/youtube_dl/extractor/dumpert.py index 1f00386..e5aadcd 100644 --- a/youtube_dl/extractor/dumpert.py +++ b/youtube_dl/extractor/dumpert.py @@ -2,14 +2,17 @@ from __future__ import unicode_literals import base64 +import re from .common import InfoExtractor -from ..compat import compat_urllib_request -from ..utils import qualities +from ..utils import ( + qualities, + sanitized_Request, +) class DumpertIE(InfoExtractor): - _VALID_URL = r'https?://(?:www\.)?dumpert\.nl/(?:mediabase|embed)/(?P[0-9]+/[0-9a-zA-Z]+)' + _VALID_URL = r'(?Phttps?)://(?:www\.)?dumpert\.nl/(?:mediabase|embed)/(?P[0-9]+/[0-9a-zA-Z]+)' _TESTS = [{ 'url': 'http://www.dumpert.nl/mediabase/6646981/951bc60f/', 'md5': '1b9318d7d5054e7dcb9dc7654f21d643', @@ -26,10 +29,12 @@ class DumpertIE(InfoExtractor): }] def _real_extract(self, url): - video_id = self._match_id(url) + mobj = re.match(self._VALID_URL, url) + video_id = mobj.group('id') + protocol = mobj.group('protocol') - url = 'https://www.dumpert.nl/mediabase/' + video_id - req = compat_urllib_request.Request(url) + url = '%s://www.dumpert.nl/mediabase/%s' % (protocol, video_id) + req = sanitized_Request(url) req.add_header('Cookie', 'nsfw=1; cpc=10') webpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/eitb.py b/youtube_dl/extractor/eitb.py index 357a219..c83845f 100644 --- a/youtube_dl/extractor/eitb.py +++ b/youtube_dl/extractor/eitb.py @@ -2,11 +2,11 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..compat import compat_urllib_request from ..utils import ( float_or_none, int_or_none, parse_iso8601, + sanitized_Request, ) @@ -57,7 +57,7 @@ class EitbIE(InfoExtractor): hls_url = media.get('HLS_SURL') if hls_url: - request = compat_urllib_request.Request( + request = sanitized_Request( 'http://mam.eitb.eus/mam/REST/ServiceMultiweb/DomainRestrictedSecurity/TokenAuth/', headers={'Referer': url}) token_data = self._download_json( diff --git a/youtube_dl/extractor/escapist.py b/youtube_dl/extractor/escapist.py index c85b4c4..a3d7bbb 100644 --- a/youtube_dl/extractor/escapist.py +++ b/youtube_dl/extractor/escapist.py @@ -3,13 +3,12 @@ from __future__ import unicode_literals import json from .common import InfoExtractor -from ..compat import compat_urllib_request - from ..utils import ( determine_ext, clean_html, int_or_none, float_or_none, + sanitized_Request, ) @@ -75,7 +74,7 @@ class EscapistIE(InfoExtractor): video_id = ims_video['videoID'] key = ims_video['hash'] - config_req = compat_urllib_request.Request( + config_req = sanitized_Request( 'http://www.escapistmagazine.com/videos/' 'vidconfig.php?videoID=%s&hash=%s' % (video_id, key)) config_req.add_header('Referer', url) diff --git a/youtube_dl/extractor/everyonesmixtape.py b/youtube_dl/extractor/everyonesmixtape.py index d872d82..493d38a 100644 --- a/youtube_dl/extractor/everyonesmixtape.py +++ b/youtube_dl/extractor/everyonesmixtape.py @@ -3,11 +3,9 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( ExtractorError, + sanitized_Request, ) @@ -42,7 +40,7 @@ class EveryonesMixtapeIE(InfoExtractor): playlist_id = mobj.group('id') pllist_url = 'http://everyonesmixtape.com/mixtape.php?a=getMixes&u=-1&linked=%s&explore=' % playlist_id - pllist_req = compat_urllib_request.Request(pllist_url) + pllist_req = sanitized_Request(pllist_url) pllist_req.add_header('X-Requested-With', 'XMLHttpRequest') playlist_list = self._download_json( @@ -55,7 +53,7 @@ class EveryonesMixtapeIE(InfoExtractor): raise ExtractorError('Playlist id not found') pl_url = 'http://everyonesmixtape.com/mixtape.php?a=getMix&id=%s&userId=null&code=' % playlist_no - pl_req = compat_urllib_request.Request(pl_url) + pl_req = sanitized_Request(pl_url) pl_req.add_header('X-Requested-With', 'XMLHttpRequest') playlist = self._download_json( pl_req, playlist_id, note='Downloading playlist info') diff --git a/youtube_dl/extractor/extremetube.py b/youtube_dl/extractor/extremetube.py index c5677c8..3403581 100644 --- a/youtube_dl/extractor/extremetube.py +++ b/youtube_dl/extractor/extremetube.py @@ -3,9 +3,9 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import compat_urllib_request from ..utils import ( int_or_none, + sanitized_Request, str_to_int, ) @@ -37,7 +37,7 @@ class ExtremeTubeIE(InfoExtractor): def _real_extract(self, url): video_id = self._match_id(url) - req = compat_urllib_request.Request(url) + req = sanitized_Request(url) req.add_header('Cookie', 'age_verified=1') webpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/facebook.py b/youtube_dl/extractor/facebook.py index f53c516..fd85441 100644 --- a/youtube_dl/extractor/facebook.py +++ b/youtube_dl/extractor/facebook.py @@ -10,11 +10,11 @@ from ..compat import ( compat_str, compat_urllib_error, compat_urllib_parse_unquote, - compat_urllib_request, ) from ..utils import ( ExtractorError, limit_length, + sanitized_Request, urlencode_postdata, get_element_by_id, clean_html, @@ -73,7 +73,7 @@ class FacebookIE(InfoExtractor): if useremail is None: return - login_page_req = compat_urllib_request.Request(self._LOGIN_URL) + login_page_req = sanitized_Request(self._LOGIN_URL) login_page_req.add_header('Cookie', 'locale=en_US') login_page = self._download_webpage(login_page_req, None, note='Downloading login page', @@ -94,7 +94,7 @@ class FacebookIE(InfoExtractor): 'timezone': '-60', 'trynum': '1', } - request = compat_urllib_request.Request(self._LOGIN_URL, urlencode_postdata(login_form)) + request = sanitized_Request(self._LOGIN_URL, urlencode_postdata(login_form)) request.add_header('Content-Type', 'application/x-www-form-urlencoded') try: login_results = self._download_webpage(request, None, @@ -109,7 +109,7 @@ class FacebookIE(InfoExtractor): r'name="h"\s+(?:\w+="[^"]+"\s+)*?value="([^"]+)"', login_results, 'h'), 'name_action_selected': 'dont_save', } - check_req = compat_urllib_request.Request(self._CHECKPOINT_URL, urlencode_postdata(check_form)) + check_req = sanitized_Request(self._CHECKPOINT_URL, urlencode_postdata(check_form)) check_req.add_header('Content-Type', 'application/x-www-form-urlencoded') check_response = self._download_webpage(check_req, None, note='Confirming login') diff --git a/youtube_dl/extractor/fc2.py b/youtube_dl/extractor/fc2.py index a406945..92e8c57 100644 --- a/youtube_dl/extractor/fc2.py +++ b/youtube_dl/extractor/fc2.py @@ -12,6 +12,7 @@ from ..compat import ( from ..utils import ( encode_dict, ExtractorError, + sanitized_Request, ) @@ -57,7 +58,7 @@ class FC2IE(InfoExtractor): } login_data = compat_urllib_parse.urlencode(encode_dict(login_form_strs)).encode('utf-8') - request = compat_urllib_request.Request( + request = sanitized_Request( 'https://secure.id.fc2.com/index.php?mode=login&switch_language=en', login_data) login_results = self._download_webpage(request, None, note='Logging in', errnote='Unable to log in') @@ -66,7 +67,7 @@ class FC2IE(InfoExtractor): return False # this is also needed - login_redir = compat_urllib_request.Request('http://id.fc2.com/?mode=redirect&login=done') + login_redir = sanitized_Request('http://id.fc2.com/?mode=redirect&login=done') self._download_webpage( login_redir, None, note='Login redirect', errnote='Login redirect failed') diff --git a/youtube_dl/extractor/flickr.py b/youtube_dl/extractor/flickr.py index 2fe76d6..91cd46e 100644 --- a/youtube_dl/extractor/flickr.py +++ b/youtube_dl/extractor/flickr.py @@ -3,10 +3,10 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import compat_urllib_request from ..utils import ( ExtractorError, find_xpath_attr, + sanitized_Request, ) @@ -30,7 +30,7 @@ class FlickrIE(InfoExtractor): video_id = mobj.group('id') video_uploader_id = mobj.group('uploader_id') webpage_url = 'http://www.flickr.com/photos/' + video_uploader_id + '/' + video_id - req = compat_urllib_request.Request(webpage_url) + req = sanitized_Request(webpage_url) req.add_header( 'User-Agent', # it needs a more recent version diff --git a/youtube_dl/extractor/fourtube.py b/youtube_dl/extractor/fourtube.py index fb6d108..fc4a5a0 100644 --- a/youtube_dl/extractor/fourtube.py +++ b/youtube_dl/extractor/fourtube.py @@ -3,12 +3,10 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( parse_duration, parse_iso8601, + sanitized_Request, str_to_int, ) @@ -93,7 +91,7 @@ class FourTubeIE(InfoExtractor): b'Content-Type': b'application/x-www-form-urlencoded', b'Origin': b'http://www.4tube.com', } - token_req = compat_urllib_request.Request(token_url, b'{}', headers) + token_req = sanitized_Request(token_url, b'{}', headers) tokens = self._download_json(token_req, video_id) formats = [{ 'url': tokens[format]['token'], diff --git a/youtube_dl/extractor/funnyordie.py b/youtube_dl/extractor/funnyordie.py index f5f1368..7f21d74 100644 --- a/youtube_dl/extractor/funnyordie.py +++ b/youtube_dl/extractor/funnyordie.py @@ -45,11 +45,20 @@ class FunnyOrDieIE(InfoExtractor): links.sort(key=lambda link: 1 if link[1] == 'mp4' else 0) - bitrates = self._html_search_regex(r']+src=(["\'])(?P.+?/master\.m3u8)\1', + webpage, 'm3u8 url', default=None, group='url') formats = [] + + m3u8_formats = self._extract_m3u8_formats( + m3u8_url, video_id, 'mp4', 'm3u8_native', m3u8_id='hls', fatal=False) + if m3u8_formats: + formats.extend(m3u8_formats) + + bitrates = [int(bitrate) for bitrate in re.findall(r'[,/]v(\d+)[,/]', m3u8_url)] + bitrates.sort() + for bitrate in bitrates: for link in links: formats.append({ diff --git a/youtube_dl/extractor/gdcvault.py b/youtube_dl/extractor/gdcvault.py index a6834db..3befd3e 100644 --- a/youtube_dl/extractor/gdcvault.py +++ b/youtube_dl/extractor/gdcvault.py @@ -3,13 +3,11 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( remove_end, HEADRequest, + sanitized_Request, ) @@ -125,7 +123,7 @@ class GDCVaultIE(InfoExtractor): 'password': password, } - request = compat_urllib_request.Request(login_url, compat_urllib_parse.urlencode(login_form)) + request = sanitized_Request(login_url, compat_urllib_parse.urlencode(login_form)) request.add_header('Content-Type', 'application/x-www-form-urlencoded') self._download_webpage(request, display_id, 'Logging in') start_page = self._download_webpage(webpage_url, display_id, 'Getting authenticated video page') diff --git a/youtube_dl/extractor/generic.py b/youtube_dl/extractor/generic.py index d0b486d..5075d13 100644 --- a/youtube_dl/extractor/generic.py +++ b/youtube_dl/extractor/generic.py @@ -11,7 +11,6 @@ from .youtube import YoutubeIE from ..compat import ( compat_etree_fromstring, compat_urllib_parse_unquote, - compat_urllib_request, compat_urlparse, compat_xml_parse_error, ) @@ -22,6 +21,7 @@ from ..utils import ( HEADRequest, is_html, orderedSet, + sanitized_Request, smuggle_url, unescapeHTML, unified_strdate, @@ -30,7 +30,10 @@ from ..utils import ( url_basename, xpath_text, ) -from .brightcove import BrightcoveIE +from .brightcove import ( + BrightcoveLegacyIE, + BrightcoveNewIE, +) from .nbc import NBCSportsVPlayerIE from .ooyala import OoyalaIE from .rutv import RUTVIE @@ -275,7 +278,7 @@ class GenericIE(InfoExtractor): # it also tests brightcove videos that need to set the 'Referer' in the # http requests { - 'add_ie': ['Brightcove'], + 'add_ie': ['BrightcoveLegacy'], 'url': 'http://www.bfmtv.com/video/bfmbusiness/cours-bourse/cours-bourse-l-analyse-technique-154522/', 'info_dict': { 'id': '2765128793001', @@ -299,7 +302,7 @@ class GenericIE(InfoExtractor): 'uploader': 'thestar.com', 'description': 'Mississauga resident David Farmer is still out of power as a result of the ice storm a month ago. To keep the house warm, Farmer cuts wood from his property for a wood burning stove downstairs.', }, - 'add_ie': ['Brightcove'], + 'add_ie': ['BrightcoveLegacy'], }, { 'url': 'http://www.championat.com/video/football/v/87/87499.html', @@ -314,7 +317,7 @@ class GenericIE(InfoExtractor): }, { # https://github.com/rg3/youtube-dl/issues/3541 - 'add_ie': ['Brightcove'], + 'add_ie': ['BrightcoveLegacy'], 'url': 'http://www.kijk.nl/sbs6/leermijvrouwenkennen/videos/jqMiXKAYan2S/aflevering-1', 'info_dict': { 'id': '3866516442001', @@ -820,6 +823,19 @@ class GenericIE(InfoExtractor): 'title': 'Os Guinness // Is It Fools Talk? // Unbelievable? Conference 2014', }, }, + # Kaltura embed protected with referrer + { + 'url': 'http://www.disney.nl/disney-channel/filmpjes/achter-de-schermen#/videoId/violetta-achter-de-schermen-ruggero', + 'info_dict': { + 'id': '1_g4fbemnq', + 'ext': 'mp4', + 'title': 'Violetta - Achter De Schermen - Ruggero', + 'description': 'Achter de schermen met Ruggero', + 'timestamp': 1435133761, + 'upload_date': '20150624', + 'uploader_id': 'echojecka', + }, + }, # Eagle.Platform embed (generic URL) { 'url': 'http://lenta.ru/news/2015/03/06/navalny/', @@ -1031,6 +1047,31 @@ class GenericIE(InfoExtractor): 'ext': 'mp4', 'title': 'cinemasnob', }, + }, + # BrightcoveInPageEmbed embed + { + 'url': 'http://www.geekandsundry.com/tabletop-bonus-wils-final-thoughts-on-dread/', + 'info_dict': { + 'id': '4238694884001', + 'ext': 'flv', + 'title': 'Tabletop: Dread, Last Thoughts', + 'description': 'Tabletop: Dread, Last Thoughts', + 'duration': 51690, + }, + }, + # JWPlayer with M3U8 + { + 'url': 'http://ren.tv/novosti/2015-09-25/sluchaynyy-prohozhiy-poymal-avtougonshchika-v-murmanske-video', + 'info_dict': { + 'id': 'playlist', + 'ext': 'mp4', + 'title': 'Случайный прохожий поймал автоугонщика в Мурманске. ВИДЕО | РЕН ТВ', + 'uploader': 'ren.tv', + }, + 'params': { + # m3u8 downloads + 'skip_download': True, + } } ] @@ -1174,7 +1215,7 @@ class GenericIE(InfoExtractor): full_response = None if head_response is False: - request = compat_urllib_request.Request(url) + request = sanitized_Request(url) request.add_header('Accept-Encoding', '*') full_response = self._request_webpage(request, video_id) head_response = full_response @@ -1203,7 +1244,7 @@ class GenericIE(InfoExtractor): '%s on generic information extractor.' % ('Forcing' if force else 'Falling back')) if not full_response: - request = compat_urllib_request.Request(url) + request = sanitized_Request(url) # Some webservers may serve compressed content of rather big size (e.g. gzipped flac) # making it impossible to download only chunk of the file (yet we need only 512kB to # test whether it's HTML or not). According to youtube-dl default Accept-Encoding @@ -1290,14 +1331,14 @@ class GenericIE(InfoExtractor): return self.playlist_result( urlrs, playlist_id=video_id, playlist_title=video_title) - # Look for BrightCove: - bc_urls = BrightcoveIE._extract_brightcove_urls(webpage) + # Look for Brightcove Legacy Studio embeds + bc_urls = BrightcoveLegacyIE._extract_brightcove_urls(webpage) if bc_urls: self.to_screen('Brightcove video detected.') entries = [{ '_type': 'url', 'url': smuggle_url(bc_url, {'Referer': url}), - 'ie_key': 'Brightcove' + 'ie_key': 'BrightcoveLegacy' } for bc_url in bc_urls] return { @@ -1307,6 +1348,11 @@ class GenericIE(InfoExtractor): 'entries': entries, } + # Look for Brightcove New Studio embeds + bc_urls = BrightcoveNewIE._extract_urls(webpage) + if bc_urls: + return _playlist_from_matches(bc_urls, ie='BrightcoveNew') + # Look for embedded rtl.nl player matches = re.findall( r']+?src="((?:https?:)?//(?:www\.)?rtl\.nl/system/videoplayer/[^"]+(?:video_)?embed[^"]+)"', @@ -1675,7 +1721,9 @@ class GenericIE(InfoExtractor): mobj = (re.search(r"(?s)kWidget\.(?:thumb)?[Ee]mbed\(\{.*?'wid'\s*:\s*'_?(?P[^']+)',.*?'entry_?[Ii]d'\s*:\s*'(?P[^']+)',", webpage) or re.search(r'(?s)(?P["\'])(?:https?:)?//cdnapi(?:sec)?\.kaltura\.com/.*?(?:p|partner_id)/(?P\d+).*?(?P=q1).*?entry_?[Ii]d\s*:\s*(?P["\'])(?P.+?)(?P=q2)', webpage)) if mobj is not None: - return self.url_result('kaltura:%(partner_id)s:%(id)s' % mobj.groupdict(), 'Kaltura') + return self.url_result(smuggle_url( + 'kaltura:%(partner_id)s:%(id)s' % mobj.groupdict(), + {'source_url': url}), 'Kaltura') # Look for Eagle.Platform embeds mobj = re.search( @@ -1720,7 +1768,7 @@ class GenericIE(InfoExtractor): # Look for UDN embeds mobj = re.search( - r']+src="(?P%s)"' % UDNEmbedIE._VALID_URL, webpage) + r']+src="(?P%s)"' % UDNEmbedIE._PROTOCOL_RELATIVE_VALID_URL, webpage) if mobj is not None: return self.url_result( compat_urlparse.urljoin(url, mobj.group('url')), 'UDNEmbed') @@ -1840,6 +1888,7 @@ class GenericIE(InfoExtractor): entries = [] for video_url in found: + video_url = video_url.replace('\\/', '/') video_url = compat_urlparse.urljoin(url, video_url) video_id = compat_urllib_parse_unquote(os.path.basename(video_url)) @@ -1851,25 +1900,24 @@ class GenericIE(InfoExtractor): # here's a fun little line of code for you: video_id = os.path.splitext(video_id)[0] + entry_info_dict = { + 'id': video_id, + 'uploader': video_uploader, + 'title': video_title, + 'age_limit': age_limit, + } + ext = determine_ext(video_url) if ext == 'smil': - entries.append({ - 'id': video_id, - 'formats': self._extract_smil_formats(video_url, video_id), - 'uploader': video_uploader, - 'title': video_title, - 'age_limit': age_limit, - }) + entry_info_dict['formats'] = self._extract_smil_formats(video_url, video_id) elif ext == 'xspf': return self.playlist_result(self._extract_xspf_playlist(video_url, video_id), video_id) + elif ext == 'm3u8': + entry_info_dict['formats'] = self._extract_m3u8_formats(video_url, video_id, ext='mp4') else: - entries.append({ - 'id': video_id, - 'url': video_url, - 'uploader': video_uploader, - 'title': video_title, - 'age_limit': age_limit, - }) + entry_info_dict['url'] = video_url + + entries.append(entry_info_dict) if len(entries) == 1: return entries[0] diff --git a/youtube_dl/extractor/hearthisat.py b/youtube_dl/extractor/hearthisat.py index a19b31a..7d86986 100644 --- a/youtube_dl/extractor/hearthisat.py +++ b/youtube_dl/extractor/hearthisat.py @@ -4,12 +4,10 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, - compat_urlparse, -) +from ..compat import compat_urlparse from ..utils import ( HEADRequest, + sanitized_Request, str_to_int, urlencode_postdata, urlhandle_detect_ext, @@ -47,7 +45,7 @@ class HearThisAtIE(InfoExtractor): r'intTrackId\s*=\s*(\d+)', webpage, 'track ID') payload = urlencode_postdata({'tracks[]': track_id}) - req = compat_urllib_request.Request(self._PLAYLIST_URL, payload) + req = sanitized_Request(self._PLAYLIST_URL, payload) req.add_header('Content-type', 'application/x-www-form-urlencoded') track = self._download_json(req, track_id, 'Downloading playlist')[0] diff --git a/youtube_dl/extractor/hotnewhiphop.py b/youtube_dl/extractor/hotnewhiphop.py index 651784b..31e2199 100644 --- a/youtube_dl/extractor/hotnewhiphop.py +++ b/youtube_dl/extractor/hotnewhiphop.py @@ -3,13 +3,11 @@ from __future__ import unicode_literals import base64 from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, HEADRequest, + sanitized_Request, ) @@ -41,7 +39,7 @@ class HotNewHipHopIE(InfoExtractor): ('mediaType', 's'), ('mediaId', video_id), ]) - r = compat_urllib_request.Request( + r = sanitized_Request( 'http://www.hotnewhiphop.com/ajax/media/getActions/', data=reqdata) r.add_header('Content-Type', 'application/x-www-form-urlencoded') mkd = self._download_json( diff --git a/youtube_dl/extractor/hypem.py b/youtube_dl/extractor/hypem.py index aa0724a..cca3dd4 100644 --- a/youtube_dl/extractor/hypem.py +++ b/youtube_dl/extractor/hypem.py @@ -4,12 +4,10 @@ import json import time from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, + sanitized_Request, ) @@ -32,7 +30,7 @@ class HypemIE(InfoExtractor): data = {'ax': 1, 'ts': time.time()} data_encoded = compat_urllib_parse.urlencode(data) complete_url = url + "?" + data_encoded - request = compat_urllib_request.Request(complete_url) + request = sanitized_Request(complete_url) response, urlh = self._download_webpage_handle( request, track_id, 'Downloading webpage with the url') cookie = urlh.headers.get('Set-Cookie', '') @@ -52,7 +50,7 @@ class HypemIE(InfoExtractor): title = track['song'] serve_url = "http://hypem.com/serve/source/%s/%s" % (track_id, key) - request = compat_urllib_request.Request( + request = sanitized_Request( serve_url, '', {'Content-Type': 'application/json'}) request.add_header('cookie', cookie) song_data = self._download_json(request, track_id, 'Downloading metadata') diff --git a/youtube_dl/extractor/instagram.py b/youtube_dl/extractor/instagram.py index 3d78f78..c158f20 100644 --- a/youtube_dl/extractor/instagram.py +++ b/youtube_dl/extractor/instagram.py @@ -10,8 +10,8 @@ from ..utils import ( class InstagramIE(InfoExtractor): - _VALID_URL = r'https://instagram\.com/p/(?P[\da-zA-Z]+)' - _TEST = { + _VALID_URL = r'https?://(?:www\.)?instagram\.com/p/(?P[^/?#&]+)' + _TESTS = [{ 'url': 'https://instagram.com/p/aye83DjauH/?foo=bar#abc', 'md5': '0d2da106a9d2631273e192b372806516', 'info_dict': { @@ -21,7 +21,10 @@ class InstagramIE(InfoExtractor): 'title': 'Video by naomipq', 'description': 'md5:1f17f0ab29bd6fe2bfad705f58de3cb8', } - } + }, { + 'url': 'https://instagram.com/p/-Cmh1cukG2/', + 'only_matching': True, + }] def _real_extract(self, url): video_id = self._match_id(url) diff --git a/youtube_dl/extractor/iprima.py b/youtube_dl/extractor/iprima.py index 821c8ec..36baf32 100644 --- a/youtube_dl/extractor/iprima.py +++ b/youtube_dl/extractor/iprima.py @@ -6,12 +6,10 @@ from random import random from math import floor from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( ExtractorError, remove_end, + sanitized_Request, ) @@ -61,7 +59,7 @@ class IPrimaIE(InfoExtractor): (floor(random() * 1073741824), floor(random() * 1073741824)) ) - req = compat_urllib_request.Request(player_url) + req = sanitized_Request(player_url) req.add_header('Referer', url) playerpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/ivi.py b/youtube_dl/extractor/ivi.py index e825944..029878d 100644 --- a/youtube_dl/extractor/ivi.py +++ b/youtube_dl/extractor/ivi.py @@ -5,11 +5,9 @@ import re import json from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( ExtractorError, + sanitized_Request, ) @@ -78,7 +76,7 @@ class IviIE(InfoExtractor): ] } - request = compat_urllib_request.Request(api_url, json.dumps(data)) + request = sanitized_Request(api_url, json.dumps(data)) video_json_page = self._download_webpage( request, video_id, 'Downloading video JSON') diff --git a/youtube_dl/extractor/kaltura.py b/youtube_dl/extractor/kaltura.py index 0dcd6cd..583b1a5 100644 --- a/youtube_dl/extractor/kaltura.py +++ b/youtube_dl/extractor/kaltura.py @@ -2,12 +2,18 @@ from __future__ import unicode_literals import re +import base64 from .common import InfoExtractor -from ..compat import compat_urllib_parse +from ..compat import ( + compat_urllib_parse, + compat_urlparse, +) from ..utils import ( + clean_html, ExtractorError, int_or_none, + unsmuggle_url, ) @@ -121,31 +127,47 @@ class KalturaIE(InfoExtractor): video_id, actions, note='Downloading video info JSON') def _real_extract(self, url): + url, smuggled_data = unsmuggle_url(url, {}) + mobj = re.match(self._VALID_URL, url) partner_id = mobj.group('partner_id_s') or mobj.group('partner_id') or mobj.group('partner_id_html5') entry_id = mobj.group('id_s') or mobj.group('id') or mobj.group('id_html5') info, source_data = self._get_video_info(entry_id, partner_id) - formats = [{ - 'format_id': '%(fileExt)s-%(bitrate)s' % f, - 'ext': f['fileExt'], - 'tbr': f['bitrate'], - 'fps': f.get('frameRate'), - 'filesize_approx': int_or_none(f.get('size'), invscale=1024), - 'container': f.get('containerFormat'), - 'vcodec': f.get('videoCodecId'), - 'height': f.get('height'), - 'width': f.get('width'), - 'url': '%s/flavorId/%s' % (info['dataUrl'], f['id']), - } for f in source_data['flavorAssets']] + source_url = smuggled_data.get('source_url') + if source_url: + referrer = base64.b64encode( + '://'.join(compat_urlparse.urlparse(source_url)[:2]) + .encode('utf-8')).decode('utf-8') + else: + referrer = None + + formats = [] + for f in source_data['flavorAssets']: + video_url = '%s/flavorId/%s' % (info['dataUrl'], f['id']) + if referrer: + video_url += '?referrer=%s' % referrer + formats.append({ + 'format_id': '%(fileExt)s-%(bitrate)s' % f, + 'ext': f.get('fileExt'), + 'tbr': int_or_none(f['bitrate']), + 'fps': int_or_none(f.get('frameRate')), + 'filesize_approx': int_or_none(f.get('size'), invscale=1024), + 'container': f.get('containerFormat'), + 'vcodec': f.get('videoCodecId'), + 'height': int_or_none(f.get('height')), + 'width': int_or_none(f.get('width')), + 'url': video_url, + }) + self._check_formats(formats, entry_id) self._sort_formats(formats) return { 'id': entry_id, 'title': info['name'], 'formats': formats, - 'description': info.get('description'), + 'description': clean_html(info.get('description')), 'thumbnail': info.get('thumbnailUrl'), 'duration': info.get('duration'), 'timestamp': info.get('createdAt'), diff --git a/youtube_dl/extractor/keezmovies.py b/youtube_dl/extractor/keezmovies.py index 82eddec..d79261b 100644 --- a/youtube_dl/extractor/keezmovies.py +++ b/youtube_dl/extractor/keezmovies.py @@ -4,10 +4,8 @@ import os import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse_urlparse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse_urlparse +from ..utils import sanitized_Request class KeezMoviesIE(InfoExtractor): @@ -26,7 +24,7 @@ class KeezMoviesIE(InfoExtractor): def _real_extract(self, url): video_id = self._match_id(url) - req = compat_urllib_request.Request(url) + req = sanitized_Request(url) req.add_header('Cookie', 'age_verified=1') webpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/letv.py b/youtube_dl/extractor/letv.py index effd9eb..be64800 100644 --- a/youtube_dl/extractor/letv.py +++ b/youtube_dl/extractor/letv.py @@ -8,13 +8,13 @@ import time from .common import InfoExtractor from ..compat import ( compat_urllib_parse, - compat_urllib_request, compat_ord, ) from ..utils import ( determine_ext, ExtractorError, parse_iso8601, + sanitized_Request, int_or_none, encode_data_uri, ) @@ -114,7 +114,7 @@ class LetvIE(InfoExtractor): 'tkey': self.calc_time_key(int(time.time())), 'domain': 'www.letv.com' } - play_json_req = compat_urllib_request.Request( + play_json_req = sanitized_Request( 'http://api.letv.com/mms/out/video/playJson?' + compat_urllib_parse.urlencode(params) ) cn_verification_proxy = self._downloader.params.get('cn_verification_proxy') diff --git a/youtube_dl/extractor/lynda.py b/youtube_dl/extractor/lynda.py index 9a207b2..d4e1ae9 100644 --- a/youtube_dl/extractor/lynda.py +++ b/youtube_dl/extractor/lynda.py @@ -7,12 +7,12 @@ from .common import InfoExtractor from ..compat import ( compat_str, compat_urllib_parse, - compat_urllib_request, ) from ..utils import ( ExtractorError, clean_html, int_or_none, + sanitized_Request, ) @@ -25,7 +25,7 @@ class LyndaBaseIE(InfoExtractor): self._login() def _login(self): - (username, password) = self._get_login_info() + username, password = self._get_login_info() if username is None: return @@ -35,7 +35,7 @@ class LyndaBaseIE(InfoExtractor): 'remember': 'false', 'stayPut': 'false' } - request = compat_urllib_request.Request( + request = sanitized_Request( self._LOGIN_URL, compat_urllib_parse.urlencode(login_form).encode('utf-8')) login_page = self._download_webpage( request, None, 'Logging in as %s' % username) @@ -64,7 +64,7 @@ class LyndaBaseIE(InfoExtractor): 'remember': 'false', 'stayPut': 'false', } - request = compat_urllib_request.Request( + request = sanitized_Request( self._LOGIN_URL, compat_urllib_parse.urlencode(confirm_form).encode('utf-8')) login_page = self._download_webpage( request, None, @@ -83,6 +83,10 @@ class LyndaBaseIE(InfoExtractor): raise ExtractorError('Unable to log in') def _logout(self): + username, _ = self._get_login_info() + if username is None: + return + self._download_webpage( 'http://www.lynda.com/ajax/logout.aspx', None, 'Logging out', 'Unable to log out', fatal=False) diff --git a/youtube_dl/extractor/metacafe.py b/youtube_dl/extractor/metacafe.py index 6e2e73a..3c786a3 100644 --- a/youtube_dl/extractor/metacafe.py +++ b/youtube_dl/extractor/metacafe.py @@ -7,12 +7,12 @@ from ..compat import ( compat_parse_qs, compat_urllib_parse, compat_urllib_parse_unquote, - compat_urllib_request, ) from ..utils import ( determine_ext, ExtractorError, int_or_none, + sanitized_Request, ) @@ -117,7 +117,7 @@ class MetacafeIE(InfoExtractor): 'filters': '0', 'submit': "Continue - I'm over 18", } - request = compat_urllib_request.Request(self._FILTER_POST, compat_urllib_parse.urlencode(disclaimer_form)) + request = sanitized_Request(self._FILTER_POST, compat_urllib_parse.urlencode(disclaimer_form)) request.add_header('Content-Type', 'application/x-www-form-urlencoded') self.report_age_confirmation() self._download_webpage(request, None, False, 'Unable to confirm age') @@ -142,7 +142,7 @@ class MetacafeIE(InfoExtractor): return self.url_result('theplatform:%s' % ext_id, 'ThePlatform') # Retrieve video webpage to extract further information - req = compat_urllib_request.Request('http://www.metacafe.com/watch/%s/' % video_id) + req = sanitized_Request('http://www.metacafe.com/watch/%s/' % video_id) # AnyClip videos require the flashversion cookie so that we get the link # to the mp4 file diff --git a/youtube_dl/extractor/minhateca.py b/youtube_dl/extractor/minhateca.py index 14934b7..e46b23a 100644 --- a/youtube_dl/extractor/minhateca.py +++ b/youtube_dl/extractor/minhateca.py @@ -2,14 +2,12 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( int_or_none, parse_duration, parse_filesize, + sanitized_Request, ) @@ -39,7 +37,7 @@ class MinhatecaIE(InfoExtractor): ('fileId', video_id), ('__RequestVerificationToken', token), ] - req = compat_urllib_request.Request( + req = sanitized_Request( 'http://minhateca.com.br/action/License/Download', data=compat_urllib_parse.urlencode(token_data)) req.add_header('Content-Type', 'application/x-www-form-urlencoded') diff --git a/youtube_dl/extractor/miomio.py b/youtube_dl/extractor/miomio.py index ce391c7..170ebd9 100644 --- a/youtube_dl/extractor/miomio.py +++ b/youtube_dl/extractor/miomio.py @@ -4,11 +4,11 @@ from __future__ import unicode_literals import random from .common import InfoExtractor -from ..compat import compat_urllib_request from ..utils import ( xpath_text, int_or_none, ExtractorError, + sanitized_Request, ) @@ -63,7 +63,7 @@ class MioMioIE(InfoExtractor): 'http://www.miomio.tv/mioplayer/mioplayerconfigfiles/xml.php?id=%s&r=%s' % (id, random.randint(100, 999)), video_id) - vid_config_request = compat_urllib_request.Request( + vid_config_request = sanitized_Request( 'http://www.miomio.tv/mioplayer/mioplayerconfigfiles/sina.php?{0}'.format(xml_config), headers=http_headers) diff --git a/youtube_dl/extractor/moevideo.py b/youtube_dl/extractor/moevideo.py index 5a66302..d930b96 100644 --- a/youtube_dl/extractor/moevideo.py +++ b/youtube_dl/extractor/moevideo.py @@ -5,13 +5,11 @@ import json import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, int_or_none, + sanitized_Request, ) @@ -80,7 +78,7 @@ class MoeVideoIE(InfoExtractor): ] r_json = json.dumps(r) post = compat_urllib_parse.urlencode({'r': r_json}) - req = compat_urllib_request.Request(self._API_URL, post) + req = sanitized_Request(self._API_URL, post) req.add_header('Content-type', 'application/x-www-form-urlencoded') response = self._download_json(req, video_id) diff --git a/youtube_dl/extractor/mofosex.py b/youtube_dl/extractor/mofosex.py index 9bf99a5..f8226cb 100644 --- a/youtube_dl/extractor/mofosex.py +++ b/youtube_dl/extractor/mofosex.py @@ -7,8 +7,8 @@ from .common import InfoExtractor from ..compat import ( compat_urllib_parse_unquote, compat_urllib_parse_urlparse, - compat_urllib_request, ) +from ..utils import sanitized_Request class MofosexIE(InfoExtractor): @@ -29,7 +29,7 @@ class MofosexIE(InfoExtractor): video_id = mobj.group('id') url = 'http://www.' + mobj.group('url') - req = compat_urllib_request.Request(url) + req = sanitized_Request(url) req.add_header('Cookie', 'age_verified=1') webpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/moniker.py b/youtube_dl/extractor/moniker.py index 7c0c4e5..f6bf94f 100644 --- a/youtube_dl/extractor/moniker.py +++ b/youtube_dl/extractor/moniker.py @@ -5,13 +5,11 @@ import os.path import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, remove_start, + sanitized_Request, ) @@ -81,7 +79,7 @@ class MonikerIE(InfoExtractor): orig_webpage, 'builtin URL', default=None, group='url') if builtin_url: - req = compat_urllib_request.Request(builtin_url) + req = sanitized_Request(builtin_url) req.add_header('Referer', url) webpage = self._download_webpage(req, video_id, 'Downloading builtin page') title = self._og_search_title(orig_webpage).strip() @@ -94,7 +92,7 @@ class MonikerIE(InfoExtractor): headers = { b'Content-Type': b'application/x-www-form-urlencoded', } - req = compat_urllib_request.Request(url, post, headers) + req = sanitized_Request(url, post, headers) webpage = self._download_webpage( req, video_id, note='Downloading video page ...') diff --git a/youtube_dl/extractor/mooshare.py b/youtube_dl/extractor/mooshare.py index 7603af5..7cc7f05 100644 --- a/youtube_dl/extractor/mooshare.py +++ b/youtube_dl/extractor/mooshare.py @@ -3,12 +3,10 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, - compat_urllib_parse, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, + sanitized_Request, ) @@ -59,7 +57,7 @@ class MooshareIE(InfoExtractor): 'hash': hash_key, } - request = compat_urllib_request.Request( + request = sanitized_Request( 'http://mooshare.biz/%s' % video_id, compat_urllib_parse.urlencode(download_form)) request.add_header('Content-Type', 'application/x-www-form-urlencoded') diff --git a/youtube_dl/extractor/movieclips.py b/youtube_dl/extractor/movieclips.py index b8c43a1..1564cb7 100644 --- a/youtube_dl/extractor/movieclips.py +++ b/youtube_dl/extractor/movieclips.py @@ -2,9 +2,7 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) +from ..utils import sanitized_Request class MovieClipsIE(InfoExtractor): @@ -25,7 +23,7 @@ class MovieClipsIE(InfoExtractor): def _real_extract(self, url): display_id = self._match_id(url) - req = compat_urllib_request.Request(url) + req = sanitized_Request(url) # it doesn't work if it thinks the browser it's too old req.add_header('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20150101 Firefox/43.0 (Chrome)') webpage = self._download_webpage(req, display_id) diff --git a/youtube_dl/extractor/mtv.py b/youtube_dl/extractor/mtv.py index 302c9bf..d887583 100644 --- a/youtube_dl/extractor/mtv.py +++ b/youtube_dl/extractor/mtv.py @@ -5,7 +5,6 @@ import re from .common import InfoExtractor from ..compat import ( compat_urllib_parse, - compat_urllib_request, compat_str, ) from ..utils import ( @@ -13,6 +12,7 @@ from ..utils import ( find_xpath_attr, fix_xml_ampersands, HEADRequest, + sanitized_Request, unescapeHTML, url_basename, RegexNotFoundError, @@ -53,7 +53,7 @@ class MTVServicesInfoExtractor(InfoExtractor): def _extract_mobile_video_formats(self, mtvn_id): webpage_url = self._MOBILE_TEMPLATE % mtvn_id - req = compat_urllib_request.Request(webpage_url) + req = sanitized_Request(webpage_url) # Otherwise we get a webpage that would execute some javascript req.add_header('User-Agent', 'curl/7') webpage = self._download_webpage(req, mtvn_id, diff --git a/youtube_dl/extractor/myvideo.py b/youtube_dl/extractor/myvideo.py index c96f472..36ab388 100644 --- a/youtube_dl/extractor/myvideo.py +++ b/youtube_dl/extractor/myvideo.py @@ -11,10 +11,10 @@ from ..compat import ( compat_ord, compat_urllib_parse, compat_urllib_parse_unquote, - compat_urllib_request, ) from ..utils import ( ExtractorError, + sanitized_Request, ) @@ -83,7 +83,7 @@ class MyVideoIE(InfoExtractor): mobj = re.search(r'data-video-service="/service/data/video/%s/config' % video_id, webpage) if mobj is not None: - request = compat_urllib_request.Request('http://www.myvideo.de/service/data/video/%s/config' % video_id, '') + request = sanitized_Request('http://www.myvideo.de/service/data/video/%s/config' % video_id, '') response = self._download_webpage(request, video_id, 'Downloading video info') info = json.loads(base64.b64decode(response).decode('utf-8')) diff --git a/youtube_dl/extractor/neteasemusic.py b/youtube_dl/extractor/neteasemusic.py index a8e0a64..15eca82 100644 --- a/youtube_dl/extractor/neteasemusic.py +++ b/youtube_dl/extractor/neteasemusic.py @@ -8,11 +8,11 @@ import re from .common import InfoExtractor from ..compat import ( - compat_urllib_request, compat_urllib_parse, compat_str, compat_itertools_count, ) +from ..utils import sanitized_Request class NetEaseMusicBaseIE(InfoExtractor): @@ -40,7 +40,7 @@ class NetEaseMusicBaseIE(InfoExtractor): if not details: continue formats.append({ - 'url': 'http://m1.music.126.net/%s/%s.%s' % + 'url': 'http://m5.music.126.net/%s/%s.%s' % (cls._encrypt(details['dfsId']), details['dfsId'], details['extension']), 'ext': details.get('extension'), @@ -56,7 +56,7 @@ class NetEaseMusicBaseIE(InfoExtractor): return int(round(ms / 1000.0)) def query_api(self, endpoint, video_id, note): - req = compat_urllib_request.Request('%s%s' % (self._API_BASE, endpoint)) + req = sanitized_Request('%s%s' % (self._API_BASE, endpoint)) req.add_header('Referer', self._API_BASE) return self._download_json(req, video_id, note) diff --git a/youtube_dl/extractor/nfb.py b/youtube_dl/extractor/nfb.py index ea07725..5bd15f7 100644 --- a/youtube_dl/extractor/nfb.py +++ b/youtube_dl/extractor/nfb.py @@ -1,10 +1,8 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, - compat_urllib_parse, -) +from ..compat import compat_urllib_parse +from ..utils import sanitized_Request class NFBIE(InfoExtractor): @@ -40,8 +38,9 @@ class NFBIE(InfoExtractor): uploader = self._html_search_regex(r'([^<]+)', page, 'director name', fatal=False) - request = compat_urllib_request.Request('https://www.nfb.ca/film/%s/player_config' % video_id, - compat_urllib_parse.urlencode({'getConfig': 'true'}).encode('ascii')) + request = sanitized_Request( + 'https://www.nfb.ca/film/%s/player_config' % video_id, + compat_urllib_parse.urlencode({'getConfig': 'true'}).encode('ascii')) request.add_header('Content-Type', 'application/x-www-form-urlencoded') request.add_header('X-NFB-Referer', 'http://www.nfb.ca/medias/flash/NFBVideoPlayer.swf') diff --git a/youtube_dl/extractor/niconico.py b/youtube_dl/extractor/niconico.py index bda1cff..586e52a 100644 --- a/youtube_dl/extractor/niconico.py +++ b/youtube_dl/extractor/niconico.py @@ -8,7 +8,6 @@ import datetime from .common import InfoExtractor from ..compat import ( compat_urllib_parse, - compat_urllib_request, compat_urlparse, ) from ..utils import ( @@ -17,6 +16,7 @@ from ..utils import ( int_or_none, parse_duration, parse_iso8601, + sanitized_Request, xpath_text, determine_ext, ) @@ -102,7 +102,7 @@ class NiconicoIE(InfoExtractor): 'password': password, } login_data = compat_urllib_parse.urlencode(encode_dict(login_form_strs)).encode('utf-8') - request = compat_urllib_request.Request( + request = sanitized_Request( 'https://secure.nicovideo.jp/secure/login', login_data) login_results = self._download_webpage( request, None, note='Logging in', errnote='Unable to log in') @@ -145,7 +145,7 @@ class NiconicoIE(InfoExtractor): 'k': thumb_play_key, 'v': video_id }) - flv_info_request = compat_urllib_request.Request( + flv_info_request = sanitized_Request( 'http://ext.nicovideo.jp/thumb_watch', flv_info_data, {'Content-Type': 'application/x-www-form-urlencoded'}) flv_info_webpage = self._download_webpage( diff --git a/youtube_dl/extractor/noco.py b/youtube_dl/extractor/noco.py index a53e27b..76bd21e 100644 --- a/youtube_dl/extractor/noco.py +++ b/youtube_dl/extractor/noco.py @@ -9,7 +9,6 @@ from .common import InfoExtractor from ..compat import ( compat_str, compat_urllib_parse, - compat_urllib_request, ) from ..utils import ( clean_html, @@ -17,6 +16,7 @@ from ..utils import ( int_or_none, float_or_none, parse_iso8601, + sanitized_Request, ) @@ -74,7 +74,7 @@ class NocoIE(InfoExtractor): 'username': username, 'password': password, } - request = compat_urllib_request.Request(self._LOGIN_URL, compat_urllib_parse.urlencode(login_form)) + request = sanitized_Request(self._LOGIN_URL, compat_urllib_parse.urlencode(login_form)) request.add_header('Content-Type', 'application/x-www-form-urlencoded; charset=UTF-8') login = self._download_json(request, None, 'Logging in as %s' % username) diff --git a/youtube_dl/extractor/nosvideo.py b/youtube_dl/extractor/nosvideo.py index f5ef856..eab816e 100644 --- a/youtube_dl/extractor/nosvideo.py +++ b/youtube_dl/extractor/nosvideo.py @@ -4,11 +4,9 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( ExtractorError, + sanitized_Request, urlencode_postdata, xpath_text, xpath_with_ns, @@ -41,7 +39,7 @@ class NosVideoIE(InfoExtractor): 'op': 'download1', 'method_free': 'Continue to Video', } - req = compat_urllib_request.Request(url, urlencode_postdata(fields)) + req = sanitized_Request(url, urlencode_postdata(fields)) req.add_header('Content-type', 'application/x-www-form-urlencoded') webpage = self._download_webpage(req, video_id, 'Downloading download page') diff --git a/youtube_dl/extractor/novamov.py b/youtube_dl/extractor/novamov.py index 04d7798..6163e88 100644 --- a/youtube_dl/extractor/novamov.py +++ b/youtube_dl/extractor/novamov.py @@ -3,11 +3,13 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urlparse, -) +from ..compat import compat_urlparse from ..utils import ( ExtractorError, + NO_DEFAULT, + encode_dict, + sanitized_Request, + urlencode_postdata, ) @@ -38,19 +40,40 @@ class NovaMovIE(InfoExtractor): } def _real_extract(self, url): - mobj = re.match(self._VALID_URL, url) - video_id = mobj.group('id') + video_id = self._match_id(url) - page = self._download_webpage( - 'http://%s/video/%s' % (self._HOST, video_id), video_id, 'Downloading video page') + url = 'http://%s/video/%s' % (self._HOST, video_id) - if re.search(self._FILE_DELETED_REGEX, page) is not None: - raise ExtractorError('Video %s does not exist' % video_id, expected=True) + webpage = self._download_webpage( + url, video_id, 'Downloading video page') - filekey = self._search_regex(self._FILEKEY_REGEX, page, 'filekey') + if re.search(self._FILE_DELETED_REGEX, webpage) is not None: + raise ExtractorError('Video %s does not exist' % video_id, expected=True) - title = self._html_search_regex(self._TITLE_REGEX, page, 'title', fatal=False) - description = self._html_search_regex(self._DESCRIPTION_REGEX, page, 'description', default='', fatal=False) + def extract_filekey(default=NO_DEFAULT): + return self._search_regex( + self._FILEKEY_REGEX, webpage, 'filekey', default=default) + + filekey = extract_filekey(default=None) + + if not filekey: + fields = self._hidden_inputs(webpage) + post_url = self._search_regex( + r']+action=(["\'])(?P.+?)\1', webpage, + 'post url', default=url, group='url') + if not post_url.startswith('http'): + post_url = compat_urlparse.urljoin(url, post_url) + request = sanitized_Request( + post_url, urlencode_postdata(encode_dict(fields))) + request.add_header('Content-Type', 'application/x-www-form-urlencoded') + request.add_header('Referer', post_url) + webpage = self._download_webpage( + request, video_id, 'Downloading continue to the video page') + + filekey = extract_filekey() + + title = self._html_search_regex(self._TITLE_REGEX, webpage, 'title', fatal=False) + description = self._html_search_regex(self._DESCRIPTION_REGEX, webpage, 'description', default='', fatal=False) api_response = self._download_webpage( 'http://%s/api/player.api.php?key=%s&file=%s' % (self._HOST, filekey, video_id), video_id, diff --git a/youtube_dl/extractor/nowness.py b/youtube_dl/extractor/nowness.py index b97f62f..d480fb5 100644 --- a/youtube_dl/extractor/nowness.py +++ b/youtube_dl/extractor/nowness.py @@ -1,12 +1,12 @@ # encoding: utf-8 from __future__ import unicode_literals -from .brightcove import BrightcoveIE +from .brightcove import BrightcoveLegacyIE from .common import InfoExtractor -from ..utils import ExtractorError -from ..compat import ( - compat_str, - compat_urllib_request, +from ..compat import compat_str +from ..utils import ( + ExtractorError, + sanitized_Request, ) @@ -22,10 +22,10 @@ class NownessBaseIE(InfoExtractor): 'http://www.nowness.com/iframe?id=%s' % video_id, video_id, note='Downloading player JavaScript', errnote='Unable to download player JavaScript') - bc_url = BrightcoveIE._extract_brightcove_url(player_code) + bc_url = BrightcoveLegacyIE._extract_brightcove_url(player_code) if bc_url is None: raise ExtractorError('Could not find player definition') - return self.url_result(bc_url, 'Brightcove') + return self.url_result(bc_url, 'BrightcoveLegacy') elif source == 'vimeo': return self.url_result('http://vimeo.com/%s' % video_id, 'Vimeo') elif source == 'youtube': @@ -37,7 +37,7 @@ class NownessBaseIE(InfoExtractor): def _api_request(self, url, request_path): display_id = self._match_id(url) - request = compat_urllib_request.Request( + request = sanitized_Request( 'http://api.nowness.com/api/' + request_path % display_id, headers={ 'X-Nowness-Language': 'zh-cn' if 'cn.nowness.com' in url else 'en-us', diff --git a/youtube_dl/extractor/nowtv.py b/youtube_dl/extractor/nowtv.py index b0bdffc..67e34b2 100644 --- a/youtube_dl/extractor/nowtv.py +++ b/youtube_dl/extractor/nowtv.py @@ -1,6 +1,8 @@ # coding: utf-8 from __future__ import unicode_literals +import re + from .common import InfoExtractor from ..compat import compat_str from ..utils import ( @@ -13,8 +15,63 @@ from ..utils import ( ) -class NowTVIE(InfoExtractor): - _VALID_URL = r'https?://(?:www\.)?nowtv\.(?:de|at|ch)/(?:rtl|rtl2|rtlnitro|superrtl|ntv|vox)/(?P.+?)/(?:player|preview)' +class NowTVBaseIE(InfoExtractor): + _VIDEO_FIELDS = ( + 'id', 'title', 'free', 'geoblocked', 'articleLong', 'articleShort', + 'broadcastStartDate', 'seoUrl', 'duration', 'files', + 'format.defaultImage169Format', 'format.defaultImage169Logo') + + def _extract_video(self, info, display_id=None): + video_id = compat_str(info['id']) + + files = info['files'] + if not files: + if info.get('geoblocked', False): + raise ExtractorError( + 'Video %s is not available from your location due to geo restriction' % video_id, + expected=True) + if not info.get('free', True): + raise ExtractorError( + 'Video %s is not available for free' % video_id, expected=True) + + formats = [] + for item in files['items']: + if determine_ext(item['path']) != 'f4v': + continue + app, play_path = remove_start(item['path'], '/').split('/', 1) + formats.append({ + 'url': 'rtmpe://fms.rtl.de', + 'app': app, + 'play_path': 'mp4:%s' % play_path, + 'ext': 'flv', + 'page_url': 'http://rtlnow.rtl.de', + 'player_url': 'http://cdn.static-fra.de/now/vodplayer.swf', + 'tbr': int_or_none(item.get('bitrate')), + }) + self._sort_formats(formats) + + title = info['title'] + description = info.get('articleLong') or info.get('articleShort') + timestamp = parse_iso8601(info.get('broadcastStartDate'), ' ') + duration = parse_duration(info.get('duration')) + + f = info.get('format', {}) + thumbnail = f.get('defaultImage169Format') or f.get('defaultImage169Logo') + + return { + 'id': video_id, + 'display_id': display_id or info.get('seoUrl'), + 'title': title, + 'description': description, + 'thumbnail': thumbnail, + 'timestamp': timestamp, + 'duration': duration, + 'formats': formats, + } + + +class NowTVIE(NowTVBaseIE): + _VALID_URL = r'https?://(?:www\.)?nowtv\.(?:de|at|ch)/(?:rtl|rtl2|rtlnitro|superrtl|ntv|vox)/(?P[^/]+)/(?:list/[^/]+/)?(?P[^/]+)/(?:player|preview)' _TESTS = [{ # rtl @@ -23,7 +80,7 @@ class NowTVIE(InfoExtractor): 'id': '203519', 'display_id': 'bauer-sucht-frau/die-neuen-bauern-und-eine-hochzeit', 'ext': 'flv', - 'title': 'Die neuen Bauern und eine Hochzeit', + 'title': 'Inka Bause stellt die neuen Bauern vor', 'description': 'md5:e234e1ed6d63cf06be5c070442612e7e', 'thumbnail': 're:^https?://.*\.jpg$', 'timestamp': 1432580700, @@ -136,58 +193,65 @@ class NowTVIE(InfoExtractor): }] def _real_extract(self, url): - display_id = self._match_id(url) - display_id_split = display_id.split('/') - if len(display_id) > 2: - display_id = '/'.join((display_id_split[0], display_id_split[-1])) + mobj = re.match(self._VALID_URL, url) + display_id = '%s/%s' % (mobj.group('show_id'), mobj.group('id')) info = self._download_json( - 'https://api.nowtv.de/v3/movies/%s?fields=id,title,free,geoblocked,articleLong,articleShort,broadcastStartDate,seoUrl,duration,format,files' % display_id, - display_id) + 'https://api.nowtv.de/v3/movies/%s?fields=%s' + % (display_id, ','.join(self._VIDEO_FIELDS)), display_id) - video_id = compat_str(info['id']) + return self._extract_video(info, display_id) - files = info['files'] - if not files: - if info.get('geoblocked', False): - raise ExtractorError( - 'Video %s is not available from your location due to geo restriction' % video_id, - expected=True) - if not info.get('free', True): - raise ExtractorError( - 'Video %s is not available for free' % video_id, expected=True) - formats = [] - for item in files['items']: - if determine_ext(item['path']) != 'f4v': - continue - app, play_path = remove_start(item['path'], '/').split('/', 1) - formats.append({ - 'url': 'rtmpe://fms.rtl.de', - 'app': app, - 'play_path': 'mp4:%s' % play_path, - 'ext': 'flv', - 'page_url': 'http://rtlnow.rtl.de', - 'player_url': 'http://cdn.static-fra.de/now/vodplayer.swf', - 'tbr': int_or_none(item.get('bitrate')), - }) - self._sort_formats(formats) +class NowTVListIE(NowTVBaseIE): + _VALID_URL = r'https?://(?:www\.)?nowtv\.(?:de|at|ch)/(?:rtl|rtl2|rtlnitro|superrtl|ntv|vox)/(?P[^/]+)/list/(?P[^?/#&]+)$' - title = info['title'] - description = info.get('articleLong') or info.get('articleShort') - timestamp = parse_iso8601(info.get('broadcastStartDate'), ' ') - duration = parse_duration(info.get('duration')) + _SHOW_FIELDS = ('title', ) + _SEASON_FIELDS = ('id', 'headline', 'seoheadline', ) - f = info.get('format', {}) - thumbnail = f.get('defaultImage169Format') or f.get('defaultImage169Logo') + _TESTS = [{ + 'url': 'http://www.nowtv.at/rtl/stern-tv/list/aktuell', + 'info_dict': { + 'id': '17006', + 'title': 'stern TV - Aktuell', + }, + 'playlist_count': 1, + }, { + 'url': 'http://www.nowtv.at/rtl/das-supertalent/list/free-staffel-8', + 'info_dict': { + 'id': '20716', + 'title': 'Das Supertalent - FREE Staffel 8', + }, + 'playlist_count': 14, + }] - return { - 'id': video_id, - 'display_id': display_id, - 'title': title, - 'description': description, - 'thumbnail': thumbnail, - 'timestamp': timestamp, - 'duration': duration, - 'formats': formats, - } + def _real_extract(self, url): + mobj = re.match(self._VALID_URL, url) + show_id = mobj.group('show_id') + season_id = mobj.group('id') + + fields = [] + fields.extend(self._SHOW_FIELDS) + fields.extend('formatTabs.%s' % field for field in self._SEASON_FIELDS) + fields.extend( + 'formatTabs.formatTabPages.container.movies.%s' % field + for field in self._VIDEO_FIELDS) + + list_info = self._download_json( + 'https://api.nowtv.de/v3/formats/seo?fields=%s&name=%s.php' + % (','.join(fields), show_id), + season_id) + + season = next( + season for season in list_info['formatTabs']['items'] + if season.get('seoheadline') == season_id) + + title = '%s - %s' % (list_info['title'], season['headline']) + + entries = [] + for container in season['formatTabPages']['items']: + for info in ((container.get('container') or {}).get('movies') or {}).get('items') or []: + entries.append(self._extract_video(info)) + + return self.playlist_result( + entries, compat_str(season.get('id') or season_id), title) diff --git a/youtube_dl/extractor/nowvideo.py b/youtube_dl/extractor/nowvideo.py index 17baa96..57ee3d3 100644 --- a/youtube_dl/extractor/nowvideo.py +++ b/youtube_dl/extractor/nowvideo.py @@ -7,9 +7,9 @@ class NowVideoIE(NovaMovIE): IE_NAME = 'nowvideo' IE_DESC = 'NowVideo' - _VALID_URL = NovaMovIE._VALID_URL_TEMPLATE % {'host': 'nowvideo\.(?:ch|ec|sx|eu|at|ag|co|li)'} + _VALID_URL = NovaMovIE._VALID_URL_TEMPLATE % {'host': 'nowvideo\.(?:to|ch|ec|sx|eu|at|ag|co|li)'} - _HOST = 'www.nowvideo.ch' + _HOST = 'www.nowvideo.to' _FILE_DELETED_REGEX = r'>This file no longer exists on our servers.<' _FILEKEY_REGEX = r'var fkzd="([^"]+)";' diff --git a/youtube_dl/extractor/nuvid.py b/youtube_dl/extractor/nuvid.py index 57928f2..9fa7cef 100644 --- a/youtube_dl/extractor/nuvid.py +++ b/youtube_dl/extractor/nuvid.py @@ -3,11 +3,9 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( parse_duration, + sanitized_Request, unified_strdate, ) @@ -33,7 +31,7 @@ class NuvidIE(InfoExtractor): formats = [] for dwnld_speed, format_id in [(0, '3gp'), (5, 'mp4')]: - request = compat_urllib_request.Request( + request = sanitized_Request( 'http://m.nuvid.com/play/%s' % video_id) request.add_header('Cookie', 'skip_download_page=1; dwnld_speed=%d; adv_show=1' % dwnld_speed) webpage = self._download_webpage( diff --git a/youtube_dl/extractor/patreon.py b/youtube_dl/extractor/patreon.py index 6cdc263..ec8876c 100644 --- a/youtube_dl/extractor/patreon.py +++ b/youtube_dl/extractor/patreon.py @@ -2,9 +2,7 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..utils import ( - js_to_json, -) +from ..utils import js_to_json class PatreonIE(InfoExtractor): @@ -65,7 +63,7 @@ class PatreonIE(InfoExtractor): 'password': password, } - request = compat_urllib_request.Request( + request = sanitized_Request( 'https://www.patreon.com/processLogin', compat_urllib_parse.urlencode(login_form).encode('utf-8') ) diff --git a/youtube_dl/extractor/pbs.py b/youtube_dl/extractor/pbs.py index 8fb9b18..b787e2a 100644 --- a/youtube_dl/extractor/pbs.py +++ b/youtube_dl/extractor/pbs.py @@ -22,7 +22,7 @@ class PBSIE(InfoExtractor): # Article with embedded player (or direct video) (?:www\.)?pbs\.org/(?:[^/]+/){2,5}(?P[^/]+?)(?:\.html)?/?(?:$|[?\#]) | # Player - video\.pbs\.org/(?:widget/)?partnerplayer/(?P[^/]+)/ + (?:video|player)\.pbs\.org/(?:widget/)?partnerplayer/(?P[^/]+)/ ) ''' @@ -170,6 +170,10 @@ class PBSIE(InfoExtractor): 'params': { 'skip_download': True, # requires ffmpeg }, + }, + { + 'url': 'http://player.pbs.org/widget/partnerplayer/2365297708/?start=0&end=0&chapterbar=false&endscreen=false&topbar=true', + 'only_matching': True, } ] _ERRORS = { @@ -259,7 +263,7 @@ class PBSIE(InfoExtractor): return self.playlist_result(entries, display_id) info = self._download_json( - 'http://video.pbs.org/videoInfo/%s?format=json&type=partner' % video_id, + 'http://player.pbs.org/videoInfo/%s?format=json&type=partner' % video_id, display_id) formats = [] diff --git a/youtube_dl/extractor/periscope.py b/youtube_dl/extractor/periscope.py index 887c802..63cc764 100644 --- a/youtube_dl/extractor/periscope.py +++ b/youtube_dl/extractor/periscope.py @@ -2,16 +2,12 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) from ..utils import parse_iso8601 class PeriscopeIE(InfoExtractor): IE_DESC = 'Periscope' - _VALID_URL = r'https?://(?:www\.)?periscope\.tv/w/(?P[^/?#]+)' + _VALID_URL = r'https?://(?:www\.)?periscope\.tv/[^/]+/(?P[^/?#]+)' # Alive example URLs can be found here http://onperiscope.com/ _TESTS = [{ 'url': 'https://www.periscope.tv/w/aJUQnjY3MjA3ODF8NTYxMDIyMDl2zCg2pECBgwTqRpQuQD352EMPTKQjT4uqlM3cgWFA-g==', @@ -29,6 +25,9 @@ class PeriscopeIE(InfoExtractor): }, { 'url': 'https://www.periscope.tv/w/1ZkKzPbMVggJv', 'only_matching': True, + }, { + 'url': 'https://www.periscope.tv/bastaakanoggano/1OdKrlkZZjOJX', + 'only_matching': True, }] def _call_api(self, method, value): @@ -81,24 +80,3 @@ class PeriscopeIE(InfoExtractor): 'thumbnails': thumbnails, 'formats': formats, } - - -class QuickscopeIE(InfoExtractor): - IE_DESC = 'Quick Scope' - _VALID_URL = r'https?://watchonperiscope\.com/broadcast/(?P\d+)' - _TEST = { - 'url': 'https://watchonperiscope.com/broadcast/56180087', - 'only_matching': True, - } - - def _real_extract(self, url): - broadcast_id = self._match_id(url) - request = compat_urllib_request.Request( - 'https://watchonperiscope.com/api/accessChannel', compat_urllib_parse.urlencode({ - 'broadcast_id': broadcast_id, - 'entry_ticket': '', - 'from_push': 'false', - 'uses_sessions': 'true', - }).encode('utf-8')) - return self.url_result( - self._download_json(request, broadcast_id)['share_url'], 'Periscope') diff --git a/youtube_dl/extractor/played.py b/youtube_dl/extractor/played.py index 8a1c296..2856af9 100644 --- a/youtube_dl/extractor/played.py +++ b/youtube_dl/extractor/played.py @@ -5,12 +5,10 @@ import re import os.path from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, + sanitized_Request, ) @@ -46,7 +44,7 @@ class PlayedIE(InfoExtractor): headers = { b'Content-Type': b'application/x-www-form-urlencoded', } - req = compat_urllib_request.Request(url, post, headers) + req = sanitized_Request(url, post, headers) webpage = self._download_webpage( req, video_id, note='Downloading video page ...') diff --git a/youtube_dl/extractor/pluralsight.py b/youtube_dl/extractor/pluralsight.py index fd32836..aa7dbcb 100644 --- a/youtube_dl/extractor/pluralsight.py +++ b/youtube_dl/extractor/pluralsight.py @@ -1,29 +1,35 @@ from __future__ import unicode_literals -import re import json +import random +import collections from .common import InfoExtractor from ..compat import ( compat_str, compat_urllib_parse, - compat_urllib_request, compat_urlparse, ) from ..utils import ( ExtractorError, int_or_none, parse_duration, + sanitized_Request, ) -class PluralsightIE(InfoExtractor): +class PluralsightBaseIE(InfoExtractor): + _API_BASE = 'http://app.pluralsight.com' + + +class PluralsightIE(PluralsightBaseIE): IE_NAME = 'pluralsight' - _VALID_URL = r'https?://(?:www\.)?pluralsight\.com/training/player\?author=(?P[^&]+)&name=(?P[^&]+)(?:&mode=live)?&clip=(?P\d+)&course=(?P[^&]+)' - _LOGIN_URL = 'https://www.pluralsight.com/id/' + _VALID_URL = r'https?://(?:(?:www|app)\.)?pluralsight\.com/training/player\?' + _LOGIN_URL = 'https://app.pluralsight.com/id/' + _NETRC_MACHINE = 'pluralsight' - _TEST = { + _TESTS = [{ 'url': 'http://www.pluralsight.com/training/player?author=mike-mckeown&name=hosting-sql-server-windows-azure-iaas-m7-mgmt&mode=live&clip=3&course=hosting-sql-server-windows-azure-iaas', 'md5': '4d458cf5cf4c593788672419a8dd4cf8', 'info_dict': { @@ -33,7 +39,14 @@ class PluralsightIE(InfoExtractor): 'duration': 338, }, 'skip': 'Requires pluralsight account credentials', - } + }, { + 'url': 'https://app.pluralsight.com/training/player?course=angularjs-get-started&author=scott-allen&name=angularjs-get-started-m1-introduction&clip=0&mode=live', + 'only_matching': True, + }, { + # available without pluralsight account + 'url': 'http://app.pluralsight.com/training/player?author=scott-allen&name=angularjs-get-started-m1-introduction&mode=live&clip=0&course=angularjs-get-started', + 'only_matching': True, + }] def _real_initialize(self): self._login() @@ -41,7 +54,7 @@ class PluralsightIE(InfoExtractor): def _login(self): (username, password) = self._get_login_info() if username is None: - self.raise_login_required('Pluralsight account is required') + return login_page = self._download_webpage( self._LOGIN_URL, None, 'Downloading login page') @@ -60,7 +73,7 @@ class PluralsightIE(InfoExtractor): if not post_url.startswith('http'): post_url = compat_urlparse.urljoin(self._LOGIN_URL, post_url) - request = compat_urllib_request.Request( + request = sanitized_Request( post_url, compat_urllib_parse.urlencode(login_form).encode('utf-8')) request.add_header('Content-Type', 'application/x-www-form-urlencoded') @@ -73,30 +86,47 @@ class PluralsightIE(InfoExtractor): if error: raise ExtractorError('Unable to login: %s' % error, expected=True) + if all(p not in response for p in ('__INITIAL_STATE__', '"currentUser"')): + raise ExtractorError('Unable to log in') + def _real_extract(self, url): - mobj = re.match(self._VALID_URL, url) - author = mobj.group('author') - name = mobj.group('name') - clip_id = mobj.group('clip') - course = mobj.group('course') + qs = compat_urlparse.parse_qs(compat_urlparse.urlparse(url).query) + + author = qs.get('author', [None])[0] + name = qs.get('name', [None])[0] + clip_id = qs.get('clip', [None])[0] + course = qs.get('course', [None])[0] + + if any(not f for f in (author, name, clip_id, course,)): + raise ExtractorError('Invalid URL', expected=True) display_id = '%s-%s' % (name, clip_id) webpage = self._download_webpage(url, display_id) - collection = self._parse_json( - self._search_regex( - r'moduleCollection\s*:\s*new\s+ModuleCollection\((\[.+?\])\s*,\s*\$rootScope\)', - webpage, 'modules'), - display_id) + modules = self._search_regex( + r'moduleCollection\s*:\s*new\s+ModuleCollection\((\[.+?\])\s*,\s*\$rootScope\)', + webpage, 'modules', default=None) + + if modules: + collection = self._parse_json(modules, display_id) + else: + # Webpage may be served in different layout (see + # https://github.com/rg3/youtube-dl/issues/7607) + collection = self._parse_json( + self._search_regex( + r'var\s+initialState\s*=\s*({.+?});\n', webpage, 'initial state'), + display_id)['course']['modules'] module, clip = None, None for module_ in collection: - if module_.get('moduleName') == name: + if name in (module_.get('moduleName'), module_.get('name')): module = module_ for clip_ in module_.get('clips', []): clip_index = clip_.get('clipIndex') + if clip_index is None: + clip_index = clip_.get('index') if clip_index is None: continue if compat_str(clip_index) == clip_id: @@ -112,13 +142,33 @@ class PluralsightIE(InfoExtractor): 'high': {'width': 1024, 'height': 768}, } + AllowedQuality = collections.namedtuple('AllowedQuality', ['ext', 'qualities']) + ALLOWED_QUALITIES = ( - ('webm', ('high',)), - ('mp4', ('low', 'medium', 'high',)), + AllowedQuality('webm', ('high',)), + AllowedQuality('mp4', ('low', 'medium', 'high',)), ) + # In order to minimize the number of calls to ViewClip API and reduce + # the probability of being throttled or banned by Pluralsight we will request + # only single format until formats listing was explicitly requested. + if self._downloader.params.get('listformats', False): + allowed_qualities = ALLOWED_QUALITIES + else: + def guess_allowed_qualities(): + req_format = self._downloader.params.get('format') or 'best' + req_format_split = req_format.split('-') + if len(req_format_split) > 1: + req_ext, req_quality = req_format_split + for allowed_quality in ALLOWED_QUALITIES: + if req_ext == allowed_quality.ext and req_quality in allowed_quality.qualities: + return (AllowedQuality(req_ext, (req_quality, )), ) + req_ext = 'webm' if self._downloader.params.get('prefer_free_formats') else 'mp4' + return (AllowedQuality(req_ext, ('high', )), ) + allowed_qualities = guess_allowed_qualities() + formats = [] - for ext, qualities in ALLOWED_QUALITIES: + for ext, qualities in allowed_qualities: for quality in qualities: f = QUALITIES[quality].copy() clip_post = { @@ -131,13 +181,24 @@ class PluralsightIE(InfoExtractor): 'mt': ext, 'q': '%dx%d' % (f['width'], f['height']), } - request = compat_urllib_request.Request( - 'http://www.pluralsight.com/training/Player/ViewClip', + request = sanitized_Request( + '%s/training/Player/ViewClip' % self._API_BASE, json.dumps(clip_post).encode('utf-8')) request.add_header('Content-Type', 'application/json;charset=utf-8') format_id = '%s-%s' % (ext, quality) clip_url = self._download_webpage( request, display_id, 'Downloading %s URL' % format_id, fatal=False) + + # Pluralsight tracks multiple sequential calls to ViewClip API and start + # to return 429 HTTP errors after some time (see + # https://github.com/rg3/youtube-dl/pull/6989). Moreover it may even lead + # to account ban (see https://github.com/rg3/youtube-dl/issues/6842). + # To somewhat reduce the probability of these consequences + # we will sleep random amount of time before each call to ViewClip. + self._sleep( + random.randint(2, 5), display_id, + '%(video_id)s: Waiting for %(timeout)s seconds to avoid throttling') + if not clip_url: continue f.update({ @@ -163,10 +224,10 @@ class PluralsightIE(InfoExtractor): } -class PluralsightCourseIE(InfoExtractor): +class PluralsightCourseIE(PluralsightBaseIE): IE_NAME = 'pluralsight:course' - _VALID_URL = r'https?://(?:www\.)?pluralsight\.com/courses/(?P[^/]+)' - _TEST = { + _VALID_URL = r'https?://(?:(?:www|app)\.)?pluralsight\.com/(?:library/)?courses/(?P[^/]+)' + _TESTS = [{ # Free course from Pluralsight Starter Subscription for Microsoft TechNet # https://offers.pluralsight.com/technet?loc=zTS3z&prod=zOTprodz&tech=zOttechz&prog=zOTprogz&type=zSOz&media=zOTmediaz&country=zUSz 'url': 'http://www.pluralsight.com/courses/hosting-sql-server-windows-azure-iaas', @@ -176,7 +237,14 @@ class PluralsightCourseIE(InfoExtractor): 'description': 'md5:61b37e60f21c4b2f91dc621a977d0986', }, 'playlist_count': 31, - } + }, { + # available without pluralsight account + 'url': 'https://www.pluralsight.com/courses/angularjs-get-started', + 'only_matching': True, + }, { + 'url': 'https://app.pluralsight.com/library/courses/understanding-microsoft-azure-amazon-aws/table-of-contents', + 'only_matching': True, + }] def _real_extract(self, url): course_id = self._match_id(url) @@ -184,14 +252,14 @@ class PluralsightCourseIE(InfoExtractor): # TODO: PSM cookie course = self._download_json( - 'http://www.pluralsight.com/data/course/%s' % course_id, + '%s/data/course/%s' % (self._API_BASE, course_id), course_id, 'Downloading course JSON') title = course['title'] description = course.get('description') or course.get('shortDescription') course_data = self._download_json( - 'http://www.pluralsight.com/data/course/content/%s' % course_id, + '%s/data/course/content/%s' % (self._API_BASE, course_id), course_id, 'Downloading course data JSON') entries = [] @@ -201,7 +269,7 @@ class PluralsightCourseIE(InfoExtractor): if not player_parameters: continue entries.append(self.url_result( - 'http://www.pluralsight.com/training/player?%s' % player_parameters, + '%s/training/player?%s' % (self._API_BASE, player_parameters), 'Pluralsight')) return self.playlist_result(entries, course_id, title, description) diff --git a/youtube_dl/extractor/pornhd.py b/youtube_dl/extractor/pornhd.py index dbb2c3b..57c78ba 100644 --- a/youtube_dl/extractor/pornhd.py +++ b/youtube_dl/extractor/pornhd.py @@ -36,7 +36,8 @@ class PornHdIE(InfoExtractor): webpage = self._download_webpage(url, display_id or video_id) title = self._html_search_regex( - r'(.+) porn HD.+?', webpage, 'title') + [r']+class=["\']video-name["\'][^>]*>([^<]+)', + r'(.+?) - .*?[Pp]ornHD.*?'], webpage, 'title') description = self._html_search_regex( r'
([^<]+)
', webpage, 'description', fatal=False) view_count = int_or_none(self._html_search_regex( diff --git a/youtube_dl/extractor/pornhub.py b/youtube_dl/extractor/pornhub.py index a656ad8..965940a 100644 --- a/youtube_dl/extractor/pornhub.py +++ b/youtube_dl/extractor/pornhub.py @@ -8,10 +8,10 @@ from ..compat import ( compat_urllib_parse_unquote, compat_urllib_parse_unquote_plus, compat_urllib_parse_urlparse, - compat_urllib_request, ) from ..utils import ( ExtractorError, + sanitized_Request, str_to_int, ) from ..aes import ( @@ -53,7 +53,7 @@ class PornHubIE(InfoExtractor): def _real_extract(self, url): video_id = self._match_id(url) - req = compat_urllib_request.Request( + req = sanitized_Request( 'http://www.pornhub.com/view_video.php?viewkey=%s' % video_id) req.add_header('Cookie', 'age_verified=1') webpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/pornotube.py b/youtube_dl/extractor/pornotube.py index 34735c5..5398e70 100644 --- a/youtube_dl/extractor/pornotube.py +++ b/youtube_dl/extractor/pornotube.py @@ -3,11 +3,9 @@ from __future__ import unicode_literals import json from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( int_or_none, + sanitized_Request, ) @@ -46,7 +44,7 @@ class PornotubeIE(InfoExtractor): 'authenticationSpaceKey': originAuthenticationSpaceKey, 'credentials': 'Clip Application', } - token_req = compat_urllib_request.Request( + token_req = sanitized_Request( 'https://api.aebn.net/auth/v1/token/primal', data=json.dumps(token_req_data).encode('utf-8')) token_req.add_header('Content-Type', 'application/json') @@ -56,7 +54,7 @@ class PornotubeIE(InfoExtractor): token = token_answer['tokenKey'] # Get video URL - delivery_req = compat_urllib_request.Request( + delivery_req = sanitized_Request( 'https://api.aebn.net/delivery/v1/clips/%s/MP4' % video_id) delivery_req.add_header('Authorization', token) delivery_info = self._download_json( @@ -64,7 +62,7 @@ class PornotubeIE(InfoExtractor): video_url = delivery_info['mediaUrl'] # Get additional info (title etc.) - info_req = compat_urllib_request.Request( + info_req = sanitized_Request( 'https://api.aebn.net/content/v1/clips/%s?expand=' 'title,description,primaryImageNumber,startSecond,endSecond,' 'movie.title,movie.MovieId,movie.boxCoverFront,movie.stars,' diff --git a/youtube_dl/extractor/primesharetv.py b/youtube_dl/extractor/primesharetv.py index 304359d..85aae95 100644 --- a/youtube_dl/extractor/primesharetv.py +++ b/youtube_dl/extractor/primesharetv.py @@ -1,11 +1,11 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, +from ..compat import compat_urllib_parse +from ..utils import ( + ExtractorError, + sanitized_Request, ) -from ..utils import ExtractorError class PrimeShareTVIE(InfoExtractor): @@ -41,7 +41,7 @@ class PrimeShareTVIE(InfoExtractor): webpage, 'wait time', default=7)) + 1 self._sleep(wait_time, video_id) - req = compat_urllib_request.Request( + req = sanitized_Request( url, compat_urllib_parse.urlencode(fields), headers) video_page = self._download_webpage( req, video_id, 'Downloading video page') diff --git a/youtube_dl/extractor/promptfile.py b/youtube_dl/extractor/promptfile.py index 8190ed6..d535728 100644 --- a/youtube_dl/extractor/promptfile.py +++ b/youtube_dl/extractor/promptfile.py @@ -4,13 +4,11 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( determine_ext, ExtractorError, + sanitized_Request, ) @@ -37,7 +35,7 @@ class PromptFileIE(InfoExtractor): fields = self._hidden_inputs(webpage) post = compat_urllib_parse.urlencode(fields) - req = compat_urllib_request.Request(url, post) + req = sanitized_Request(url, post) req.add_header('Content-type', 'application/x-www-form-urlencoded') webpage = self._download_webpage( req, video_id, 'Downloading video page') diff --git a/youtube_dl/extractor/qqmusic.py b/youtube_dl/extractor/qqmusic.py index c98539f..1ba3bbd 100644 --- a/youtube_dl/extractor/qqmusic.py +++ b/youtube_dl/extractor/qqmusic.py @@ -7,11 +7,11 @@ import re from .common import InfoExtractor from ..utils import ( + sanitized_Request, strip_jsonp, unescapeHTML, clean_html, ) -from ..compat import compat_urllib_request class QQMusicIE(InfoExtractor): @@ -201,7 +201,7 @@ class QQMusicSingerIE(QQPlaylistBaseIE): singer_desc = None if singer_id: - req = compat_urllib_request.Request( + req = sanitized_Request( 'http://s.plcloud.music.qq.com/fcgi-bin/fcg_get_singer_desc.fcg?utf8=1&outCharset=utf-8&format=xml&singerid=%s' % singer_id) req.add_header( 'Referer', 'http://s.plcloud.music.qq.com/xhr_proxy_utf8.html') diff --git a/youtube_dl/extractor/rtve.py b/youtube_dl/extractor/rtve.py index 5b97d33..603d7bd 100644 --- a/youtube_dl/extractor/rtve.py +++ b/youtube_dl/extractor/rtve.py @@ -6,11 +6,11 @@ import re import time from .common import InfoExtractor -from ..compat import compat_urllib_request, compat_urlparse from ..utils import ( ExtractorError, float_or_none, remove_end, + sanitized_Request, std_headers, struct_unpack, ) @@ -102,20 +102,14 @@ class RTVEALaCartaIE(InfoExtractor): if info['state'] == 'DESPU': raise ExtractorError('The video is no longer available', expected=True) png_url = 'http://www.rtve.es/ztnr/movil/thumbnail/%s/videos/%s.png' % (self._manager, video_id) - png_request = compat_urllib_request.Request(png_url) + png_request = sanitized_Request(png_url) png_request.add_header('Referer', url) png = self._download_webpage(png_request, video_id, 'Downloading url information') video_url = _decrypt_url(png) if not video_url.endswith('.f4m'): - auth_url = video_url.replace( + video_url = video_url.replace( 'resources/', 'auth/resources/' ).replace('.net.rtve', '.multimedia.cdn.rtve') - video_path = self._download_webpage( - auth_url, video_id, 'Getting video url') - # Use mvod1.akcdn instead of flash.akamaihd.multimedia.cdn to get - # the right Content-Length header and the mp4 format - video_url = compat_urlparse.urljoin( - 'http://mvod1.akcdn.rtve.es/', video_path) subtitles = None if info.get('sbtFile') is not None: diff --git a/youtube_dl/extractor/rutube.py b/youtube_dl/extractor/rutube.py index d94dc73..6b09550 100644 --- a/youtube_dl/extractor/rutube.py +++ b/youtube_dl/extractor/rutube.py @@ -9,7 +9,7 @@ from ..compat import ( compat_str, ) from ..utils import ( - ExtractorError, + determine_ext, unified_strdate, ) @@ -51,10 +51,25 @@ class RutubeIE(InfoExtractor): 'http://rutube.ru/api/play/options/%s/?format=json' % video_id, video_id, 'Downloading options JSON') - m3u8_url = options['video_balancer'].get('m3u8') - if m3u8_url is None: - raise ExtractorError('Couldn\'t find m3u8 manifest url') - formats = self._extract_m3u8_formats(m3u8_url, video_id, ext='mp4') + formats = [] + for format_id, format_url in options['video_balancer'].items(): + ext = determine_ext(format_url) + if ext == 'm3u8': + m3u8_formats = self._extract_m3u8_formats( + format_url, video_id, 'mp4', m3u8_id=format_id, fatal=False) + if m3u8_formats: + formats.extend(m3u8_formats) + elif ext == 'f4m': + f4m_formats = self._extract_f4m_formats( + format_url, video_id, f4m_id=format_id, fatal=False) + if f4m_formats: + formats.extend(f4m_formats) + else: + formats.append({ + 'url': format_url, + 'format_id': format_id, + }) + self._sort_formats(formats) return { 'id': video['id'], @@ -74,9 +89,9 @@ class RutubeIE(InfoExtractor): class RutubeEmbedIE(InfoExtractor): IE_NAME = 'rutube:embed' IE_DESC = 'Rutube embedded videos' - _VALID_URL = 'https?://rutube\.ru/video/embed/(?P[0-9]+)' + _VALID_URL = 'https?://rutube\.ru/(?:video|play)/embed/(?P[0-9]+)' - _TEST = { + _TESTS = [{ 'url': 'http://rutube.ru/video/embed/6722881?vk_puid37=&vk_puid38=', 'info_dict': { 'id': 'a10e53b86e8f349080f718582ce4c661', @@ -90,7 +105,10 @@ class RutubeEmbedIE(InfoExtractor): 'params': { 'skip_download': 'Requires ffmpeg', }, - } + }, { + 'url': 'http://rutube.ru/play/embed/8083783', + 'only_matching': True, + }] def _real_extract(self, url): embed_id = self._match_id(url) diff --git a/youtube_dl/extractor/ruutu.py b/youtube_dl/extractor/ruutu.py index a16b73f..e417bf6 100644 --- a/youtube_dl/extractor/ruutu.py +++ b/youtube_dl/extractor/ruutu.py @@ -57,16 +57,21 @@ class RuutuIE(InfoExtractor): extract_formats(child) elif child.tag.endswith('File'): video_url = child.text - if not video_url or video_url in processed_urls or 'NOT_USED' in video_url: + if (not video_url or video_url in processed_urls or + any(p in video_url for p in ('NOT_USED', 'NOT-USED'))): return processed_urls.append(video_url) ext = determine_ext(video_url) if ext == 'm3u8': - formats.extend(self._extract_m3u8_formats( - video_url, video_id, 'mp4', m3u8_id='hls')) + m3u8_formats = self._extract_m3u8_formats( + video_url, video_id, 'mp4', m3u8_id='hls', fatal=False) + if m3u8_formats: + formats.extend(m3u8_formats) elif ext == 'f4m': - formats.extend(self._extract_f4m_formats( - video_url, video_id, f4m_id='hds')) + f4m_formats = self._extract_f4m_formats( + video_url, video_id, f4m_id='hds', fatal=False) + if f4m_formats: + formats.extend(f4m_formats) else: proto = compat_urllib_parse_urlparse(video_url).scheme if not child.tag.startswith('HTTP') and proto != 'rtmp': diff --git a/youtube_dl/extractor/safari.py b/youtube_dl/extractor/safari.py index a602af6..9197042 100644 --- a/youtube_dl/extractor/safari.py +++ b/youtube_dl/extractor/safari.py @@ -4,14 +4,12 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from .brightcove import BrightcoveIE +from .brightcove import BrightcoveLegacyIE -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, + sanitized_Request, smuggle_url, std_headers, ) @@ -58,7 +56,7 @@ class SafariBaseIE(InfoExtractor): 'next': '', } - request = compat_urllib_request.Request( + request = sanitized_Request( self._LOGIN_URL, compat_urllib_parse.urlencode(login_form), headers=headers) login_page = self._download_webpage( request, None, 'Logging in as %s' % username) @@ -112,11 +110,11 @@ class SafariIE(SafariBaseIE): '%s/%s/chapter-content/%s.html' % (self._API_BASE, course_id, part), part) - bc_url = BrightcoveIE._extract_brightcove_url(webpage) + bc_url = BrightcoveLegacyIE._extract_brightcove_url(webpage) if not bc_url: raise ExtractorError('Could not extract Brightcove URL from %s' % url, expected=True) - return self.url_result(smuggle_url(bc_url, {'Referer': url}), 'Brightcove') + return self.url_result(smuggle_url(bc_url, {'Referer': url}), 'BrightcoveLegacy') class SafariCourseIE(SafariBaseIE): diff --git a/youtube_dl/extractor/sandia.py b/youtube_dl/extractor/sandia.py index 9c88167..759898a 100644 --- a/youtube_dl/extractor/sandia.py +++ b/youtube_dl/extractor/sandia.py @@ -6,14 +6,12 @@ import json import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, - compat_urlparse, -) +from ..compat import compat_urlparse from ..utils import ( int_or_none, js_to_json, mimetype2ext, + sanitized_Request, unified_strdate, ) @@ -37,7 +35,7 @@ class SandiaIE(InfoExtractor): def _real_extract(self, url): video_id = self._match_id(url) - req = compat_urllib_request.Request(url) + req = sanitized_Request(url) req.add_header('Cookie', 'MediasitePlayerCaps=ClientPlugins=4') webpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/shared.py b/youtube_dl/extractor/shared.py index c5636e8..8eda3c8 100644 --- a/youtube_dl/extractor/shared.py +++ b/youtube_dl/extractor/shared.py @@ -3,13 +3,11 @@ from __future__ import unicode_literals import base64 from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, int_or_none, + sanitized_Request, ) @@ -46,7 +44,7 @@ class SharedIE(InfoExtractor): 'Video %s does not exist' % video_id, expected=True) download_form = self._hidden_inputs(webpage) - request = compat_urllib_request.Request( + request = sanitized_Request( url, compat_urllib_parse.urlencode(download_form)) request.add_header('Content-Type', 'application/x-www-form-urlencoded') diff --git a/youtube_dl/extractor/sharesix.py b/youtube_dl/extractor/sharesix.py index ac3e3ad..f1ea9bd 100644 --- a/youtube_dl/extractor/sharesix.py +++ b/youtube_dl/extractor/sharesix.py @@ -4,12 +4,10 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( parse_duration, + sanitized_Request, ) @@ -50,7 +48,7 @@ class ShareSixIE(InfoExtractor): 'method_free': 'Free' } post = compat_urllib_parse.urlencode(fields) - req = compat_urllib_request.Request(url, post) + req = sanitized_Request(url, post) req.add_header('Content-type', 'application/x-www-form-urlencoded') webpage = self._download_webpage(req, video_id, diff --git a/youtube_dl/extractor/sina.py b/youtube_dl/extractor/sina.py index 0891a44..b2258a0 100644 --- a/youtube_dl/extractor/sina.py +++ b/youtube_dl/extractor/sina.py @@ -4,10 +4,8 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, - compat_urllib_parse, -) +from ..compat import compat_urllib_parse +from ..utils import sanitized_Request class SinaIE(InfoExtractor): @@ -61,7 +59,7 @@ class SinaIE(InfoExtractor): if mobj.group('token') is not None: # The video id is in the redirected url self.to_screen('Getting video id') - request = compat_urllib_request.Request(url) + request = sanitized_Request(url) request.get_method = lambda: 'HEAD' (_, urlh) = self._download_webpage_handle(request, 'NA', False) return self._real_extract(urlh.geturl()) diff --git a/youtube_dl/extractor/smotri.py b/youtube_dl/extractor/smotri.py index 35a81ee..30210c8 100644 --- a/youtube_dl/extractor/smotri.py +++ b/youtube_dl/extractor/smotri.py @@ -7,13 +7,11 @@ import hashlib import uuid from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, int_or_none, + sanitized_Request, unified_strdate, ) @@ -176,7 +174,7 @@ class SmotriIE(InfoExtractor): if video_password: video_form['pass'] = hashlib.md5(video_password.encode('utf-8')).hexdigest() - request = compat_urllib_request.Request( + request = sanitized_Request( 'http://smotri.com/video/view/url/bot/', compat_urllib_parse.urlencode(video_form)) request.add_header('Content-Type', 'application/x-www-form-urlencoded') @@ -339,7 +337,7 @@ class SmotriBroadcastIE(InfoExtractor): 'password': password, } - request = compat_urllib_request.Request( + request = sanitized_Request( broadcast_url + '/?no_redirect=1', compat_urllib_parse.urlencode(login_form)) request.add_header('Content-Type', 'application/x-www-form-urlencoded') broadcast_page = self._download_webpage( diff --git a/youtube_dl/extractor/sohu.py b/youtube_dl/extractor/sohu.py index ba2d5e1..daf6ad5 100644 --- a/youtube_dl/extractor/sohu.py +++ b/youtube_dl/extractor/sohu.py @@ -6,11 +6,11 @@ import re from .common import InfoExtractor from ..compat import ( compat_str, - compat_urllib_request, compat_urllib_parse, ) from ..utils import ( ExtractorError, + sanitized_Request, ) @@ -96,7 +96,7 @@ class SohuIE(InfoExtractor): else: base_data_url = 'http://hot.vrs.sohu.com/vrs_flash.action?vid=' - req = compat_urllib_request.Request(base_data_url + vid_id) + req = sanitized_Request(base_data_url + vid_id) cn_verification_proxy = self._downloader.params.get('cn_verification_proxy') if cn_verification_proxy: diff --git a/youtube_dl/extractor/soundcloud.py b/youtube_dl/extractor/soundcloud.py index 2b60d35..02e64e0 100644 --- a/youtube_dl/extractor/soundcloud.py +++ b/youtube_dl/extractor/soundcloud.py @@ -4,13 +4,17 @@ from __future__ import unicode_literals import re import itertools -from .common import InfoExtractor +from .common import ( + InfoExtractor, + SearchInfoExtractor +) from ..compat import ( compat_str, compat_urlparse, compat_urllib_parse, ) from ..utils import ( + encode_dict, ExtractorError, int_or_none, unified_strdate, @@ -469,3 +473,60 @@ class SoundcloudPlaylistIE(SoundcloudIE): 'description': data.get('description'), 'entries': entries, } + + +class SoundcloudSearchIE(SearchInfoExtractor, SoundcloudIE): + IE_NAME = 'soundcloud:search' + IE_DESC = 'Soundcloud search' + _MAX_RESULTS = float('inf') + _TESTS = [{ + 'url': 'scsearch15:post-avant jazzcore', + 'info_dict': { + 'title': 'post-avant jazzcore', + }, + 'playlist_count': 15, + }] + + _SEARCH_KEY = 'scsearch' + _MAX_RESULTS_PER_PAGE = 200 + _DEFAULT_RESULTS_PER_PAGE = 50 + _API_V2_BASE = 'https://api-v2.soundcloud.com' + + def _get_collection(self, endpoint, collection_id, **query): + limit = min( + query.get('limit', self._DEFAULT_RESULTS_PER_PAGE), + self._MAX_RESULTS_PER_PAGE) + query['limit'] = limit + query['client_id'] = self._CLIENT_ID + query['linked_partitioning'] = '1' + query['offset'] = 0 + data = compat_urllib_parse.urlencode(encode_dict(query)) + next_url = '{0}{1}?{2}'.format(self._API_V2_BASE, endpoint, data) + + collected_results = 0 + + for i in itertools.count(1): + response = self._download_json( + next_url, collection_id, 'Downloading page {0}'.format(i), + 'Unable to download API page') + + collection = response.get('collection', []) + if not collection: + break + + collection = list(filter(bool, collection)) + collected_results += len(collection) + + for item in collection: + yield self.url_result(item['uri'], SoundcloudIE.ie_key()) + + if not collection or collected_results >= limit: + break + + next_url = response.get('next_href') + if not next_url: + break + + def _get_n_results(self, query, n): + tracks = self._get_collection('/search/tracks', query, limit=n, q=query) + return self.playlist_result(tracks, playlist_title=query) diff --git a/youtube_dl/extractor/space.py b/youtube_dl/extractor/space.py index c2d0d36..ebb5d6e 100644 --- a/youtube_dl/extractor/space.py +++ b/youtube_dl/extractor/space.py @@ -3,14 +3,14 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from .brightcove import BrightcoveIE +from .brightcove import BrightcoveLegacyIE from ..utils import RegexNotFoundError, ExtractorError class SpaceIE(InfoExtractor): _VALID_URL = r'https?://(?:(?:www|m)\.)?space\.com/\d+-(?P[^/\.\?]*?)-video\.html' _TEST = { - 'add_ie': ['Brightcove'], + 'add_ie': ['BrightcoveLegacy'], 'url': 'http://www.space.com/23373-huge-martian-landforms-detail-revealed-by-european-probe-video.html', 'info_dict': { 'id': '2780937028001', @@ -31,8 +31,8 @@ class SpaceIE(InfoExtractor): brightcove_url = self._og_search_video_url(webpage) except RegexNotFoundError: # Other videos works fine with the info from the object - brightcove_url = BrightcoveIE._extract_brightcove_url(webpage) + brightcove_url = BrightcoveLegacyIE._extract_brightcove_url(webpage) if brightcove_url is None: raise ExtractorError( 'The webpage does not contain a video', expected=True) - return self.url_result(brightcove_url, BrightcoveIE.ie_key()) + return self.url_result(brightcove_url, BrightcoveLegacyIE.ie_key()) diff --git a/youtube_dl/extractor/spankwire.py b/youtube_dl/extractor/spankwire.py index 9e8fb35..692fd78 100644 --- a/youtube_dl/extractor/spankwire.py +++ b/youtube_dl/extractor/spankwire.py @@ -6,9 +6,9 @@ from .common import InfoExtractor from ..compat import ( compat_urllib_parse_unquote, compat_urllib_parse_urlparse, - compat_urllib_request, ) from ..utils import ( + sanitized_Request, str_to_int, unified_strdate, ) @@ -51,7 +51,7 @@ class SpankwireIE(InfoExtractor): mobj = re.match(self._VALID_URL, url) video_id = mobj.group('id') - req = compat_urllib_request.Request('http://www.' + mobj.group('url')) + req = sanitized_Request('http://www.' + mobj.group('url')) req.add_header('Cookie', 'age_verified=1') webpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/sportdeutschland.py b/youtube_dl/extractor/sportdeutschland.py index 7ec6c61..ebb75f0 100644 --- a/youtube_dl/extractor/sportdeutschland.py +++ b/youtube_dl/extractor/sportdeutschland.py @@ -4,11 +4,9 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( parse_iso8601, + sanitized_Request, ) @@ -54,7 +52,7 @@ class SportDeutschlandIE(InfoExtractor): api_url = 'http://proxy.vidibusdynamic.net/sportdeutschland.tv/api/permalinks/%s/%s?access_token=true' % ( sport_id, video_id) - req = compat_urllib_request.Request(api_url, headers={ + req = sanitized_Request(api_url, headers={ 'Accept': 'application/vnd.vidibus.v2.html+json', 'Referer': url, }) diff --git a/youtube_dl/extractor/streamcloud.py b/youtube_dl/extractor/streamcloud.py index d4e1340..77841b9 100644 --- a/youtube_dl/extractor/streamcloud.py +++ b/youtube_dl/extractor/streamcloud.py @@ -4,10 +4,8 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse +from ..utils import sanitized_Request class StreamcloudIE(InfoExtractor): @@ -43,7 +41,7 @@ class StreamcloudIE(InfoExtractor): headers = { b'Content-Type': b'application/x-www-form-urlencoded', } - req = compat_urllib_request.Request(url, post, headers) + req = sanitized_Request(url, post, headers) webpage = self._download_webpage( req, video_id, note='Downloading video page ...') diff --git a/youtube_dl/extractor/streamcz.py b/youtube_dl/extractor/streamcz.py index e92b932..d3d2b7e 100644 --- a/youtube_dl/extractor/streamcz.py +++ b/youtube_dl/extractor/streamcz.py @@ -5,11 +5,9 @@ import hashlib import time from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( int_or_none, + sanitized_Request, ) @@ -54,7 +52,7 @@ class StreamCZIE(InfoExtractor): video_id = self._match_id(url) api_path = '/episode/%s' % video_id - req = compat_urllib_request.Request(self._API_URL + api_path) + req = sanitized_Request(self._API_URL + api_path) req.add_header('Api-Password', _get_api_key(api_path)) data = self._download_json(req, video_id) diff --git a/youtube_dl/extractor/tapely.py b/youtube_dl/extractor/tapely.py index 744f9db..ed560bd 100644 --- a/youtube_dl/extractor/tapely.py +++ b/youtube_dl/extractor/tapely.py @@ -4,14 +4,12 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( clean_html, ExtractorError, float_or_none, parse_iso8601, + sanitized_Request, ) @@ -53,7 +51,7 @@ class TapelyIE(InfoExtractor): display_id = mobj.group('id') playlist_url = self._API_URL.format(display_id) - request = compat_urllib_request.Request(playlist_url) + request = sanitized_Request(playlist_url) request.add_header('X-Requested-With', 'XMLHttpRequest') request.add_header('Accept', 'application/json') request.add_header('Referer', url) diff --git a/youtube_dl/extractor/theplatform.py b/youtube_dl/extractor/theplatform.py index 25edc31..1555aa7 100644 --- a/youtube_dl/extractor/theplatform.py +++ b/youtube_dl/extractor/theplatform.py @@ -139,6 +139,11 @@ class ThePlatformIE(ThePlatformBaseIE): 'upload_date': '20150701', 'categories': ['Today/Shows/Orange Room', 'Today/Sections/Money', 'Today/Topics/Tech', "Today/Topics/Editor's picks"], }, + }, { + # From http://www.nbc.com/the-blacklist/video/sir-crispin-crandall/2928790?onid=137781#vc137781=1 + # geo-restricted (US), HLS encrypted with AES-128 + 'url': 'http://player.theplatform.com/p/NnzsPC/onsite_universal/select/media/guid/2410887629/2928790?fwsitesection=nbc_the_blacklist_video_library&autoPlay=true&carouselID=137781', + 'only_matching': True, }] @staticmethod @@ -182,8 +187,12 @@ class ThePlatformIE(ThePlatformBaseIE): # Seems there's no pattern for the interested script filename, so # I try one by one for script in reversed(scripts): - feed_script = self._download_webpage(script, video_id, 'Downloading feed script') - feed_id = self._search_regex(r'defaultFeedId\s*:\s*"([^"]+)"', feed_script, 'default feed id', default=None) + feed_script = self._download_webpage( + self._proto_relative_url(script, 'http:'), + video_id, 'Downloading feed script') + feed_id = self._search_regex( + r'defaultFeedId\s*:\s*"([^"]+)"', feed_script, + 'default feed id', default=None) if feed_id is not None: break if feed_id is None: @@ -193,6 +202,15 @@ class ThePlatformIE(ThePlatformBaseIE): if smuggled_data.get('force_smil_url', False): smil_url = url + # Explicitly specified SMIL (see https://github.com/rg3/youtube-dl/issues/7385) + elif '/guid/' in url: + webpage = self._download_webpage(url, video_id) + smil_url = self._search_regex( + r'<link[^>]+href=(["\'])(?P<url>.+?)\1[^>]+type=["\']application/smil\+xml', + webpage, 'smil url', group='url') + path = self._search_regex( + r'link\.theplatform\.com/s/((?:[^/?#&]+/)+[^/?#&]+)', smil_url, 'path') + smil_url += '?' if '?' not in smil_url else '&' + 'formats=m3u,mpeg4&format=SMIL' elif mobj.group('config'): config_url = url + '&form=json' config_url = config_url.replace('swf/', 'config/') diff --git a/youtube_dl/extractor/tlc.py b/youtube_dl/extractor/tlc.py index 1326361..d6d038a 100644 --- a/youtube_dl/extractor/tlc.py +++ b/youtube_dl/extractor/tlc.py @@ -3,7 +3,7 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from .brightcove import BrightcoveIE +from .brightcove import BrightcoveLegacyIE from .discovery import DiscoveryIE from ..compat import compat_urlparse @@ -66,6 +66,6 @@ class TlcDeIE(InfoExtractor): return { '_type': 'url', - 'url': BrightcoveIE._extract_brightcove_url(iframe), - 'ie': BrightcoveIE.ie_key(), + 'url': BrightcoveLegacyIE._extract_brightcove_url(iframe), + 'ie': BrightcoveLegacyIE.ie_key(), } diff --git a/youtube_dl/extractor/tube8.py b/youtube_dl/extractor/tube8.py index c9cb693..46ef61f 100644 --- a/youtube_dl/extractor/tube8.py +++ b/youtube_dl/extractor/tube8.py @@ -4,12 +4,10 @@ import json import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse_urlparse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse_urlparse from ..utils import ( int_or_none, + sanitized_Request, str_to_int, ) from ..aes import aes_decrypt_text @@ -42,7 +40,7 @@ class Tube8IE(InfoExtractor): video_id = mobj.group('id') display_id = mobj.group('display_id') - req = compat_urllib_request.Request(url) + req = sanitized_Request(url) req.add_header('Cookie', 'age_verified=1') webpage = self._download_webpage(req, display_id) diff --git a/youtube_dl/extractor/tubitv.py b/youtube_dl/extractor/tubitv.py index 4f86b3e..6d78b5d 100644 --- a/youtube_dl/extractor/tubitv.py +++ b/youtube_dl/extractor/tubitv.py @@ -5,13 +5,11 @@ import codecs import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, int_or_none, + sanitized_Request, ) @@ -44,7 +42,7 @@ class TubiTvIE(InfoExtractor): 'password': password, } payload = compat_urllib_parse.urlencode(form_data).encode('utf-8') - request = compat_urllib_request.Request(self._LOGIN_URL, payload) + request = sanitized_Request(self._LOGIN_URL, payload) request.add_header('Content-Type', 'application/x-www-form-urlencoded') login_page = self._download_webpage( request, None, False, 'Wrong login info') diff --git a/youtube_dl/extractor/twitch.py b/youtube_dl/extractor/twitch.py index 3ec08b6..69882da 100644 --- a/youtube_dl/extractor/twitch.py +++ b/youtube_dl/extractor/twitch.py @@ -11,7 +11,6 @@ from ..compat import ( compat_str, compat_urllib_parse, compat_urllib_parse_urlparse, - compat_urllib_request, compat_urlparse, ) from ..utils import ( @@ -20,6 +19,7 @@ from ..utils import ( int_or_none, parse_duration, parse_iso8601, + sanitized_Request, ) @@ -48,7 +48,7 @@ class TwitchBaseIE(InfoExtractor): for cookie in self._downloader.cookiejar: if cookie.name == 'api_token': headers['Twitch-Api-Token'] = cookie.value - request = compat_urllib_request.Request(url, headers=headers) + request = sanitized_Request(url, headers=headers) response = super(TwitchBaseIE, self)._download_json(request, video_id, note) self._handle_error(response) return response @@ -80,7 +80,7 @@ class TwitchBaseIE(InfoExtractor): if not post_url.startswith('http'): post_url = compat_urlparse.urljoin(redirect_url, post_url) - request = compat_urllib_request.Request( + request = sanitized_Request( post_url, compat_urllib_parse.urlencode(encode_dict(login_form)).encode('utf-8')) request.add_header('Referer', redirect_url) response = self._download_webpage( diff --git a/youtube_dl/extractor/twitter.py b/youtube_dl/extractor/twitter.py index 9d3e46b..a161f04 100644 --- a/youtube_dl/extractor/twitter.py +++ b/youtube_dl/extractor/twitter.py @@ -4,11 +4,13 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import compat_urllib_request from ..utils import ( float_or_none, xpath_text, remove_end, + int_or_none, + ExtractorError, + sanitized_Request, ) @@ -18,7 +20,7 @@ class TwitterCardIE(InfoExtractor): _TESTS = [ { 'url': 'https://twitter.com/i/cards/tfw/v1/560070183650213889', - 'md5': '7d2f6b4d2eb841a7ccc893d479bfceb4', + 'md5': '4fa26a35f9d1bf4b646590ba8e84be19', 'info_dict': { 'id': '560070183650213889', 'ext': 'mp4', @@ -50,6 +52,20 @@ class TwitterCardIE(InfoExtractor): 'uploader': 'OMG! Ubuntu!', 'uploader_id': 'omgubuntu', }, + 'add_ie': ['Youtube'], + }, + { + 'url': 'https://twitter.com/i/cards/tfw/v1/665289828897005568', + 'md5': 'ab2745d0b0ce53319a534fccaa986439', + 'info_dict': { + 'id': 'iBb2x00UVlv', + 'ext': 'mp4', + 'upload_date': '20151113', + 'uploader_id': '1189339351084113920', + 'uploader': '@ArsenalTerje', + 'title': 'Vine by @ArsenalTerje', + }, + 'add_ie': ['Vine'], } ] @@ -65,15 +81,15 @@ class TwitterCardIE(InfoExtractor): config = None formats = [] for user_agent in USER_AGENTS: - request = compat_urllib_request.Request(url) + request = sanitized_Request(url) request.add_header('User-Agent', user_agent) webpage = self._download_webpage(request, video_id) - youtube_url = self._html_search_regex( - r'<iframe[^>]+src="((?:https?:)?//www.youtube.com/embed/[^"]+)"', - webpage, 'youtube iframe', default=None) - if youtube_url: - return self.url_result(youtube_url, 'Youtube') + iframe_url = self._html_search_regex( + r'<iframe[^>]+src="((?:https?:)?//(?:www.youtube.com/embed/[^"]+|(?:www\.)?vine\.co/v/\w+/card))"', + webpage, 'video iframe', default=None) + if iframe_url: + return self.url_result(iframe_url) config = self._parse_json(self._html_search_regex( r'data-player-config="([^"]+)"', webpage, 'data player config'), @@ -120,9 +136,9 @@ class TwitterIE(InfoExtractor): _VALID_URL = r'https?://(?:www\.|m\.|mobile\.)?twitter\.com/(?P<user_id>[^/]+)/status/(?P<id>\d+)' _TEMPLATE_URL = 'https://twitter.com/%s/status/%s' - _TEST = { + _TESTS = [{ 'url': 'https://twitter.com/freethenipple/status/643211948184596480', - 'md5': '31cd83a116fc41f99ae3d909d4caf6a0', + 'md5': 'db6612ec5d03355953c3ca9250c97e5e', 'info_dict': { 'id': '643211948184596480', 'ext': 'mp4', @@ -133,7 +149,30 @@ class TwitterIE(InfoExtractor): 'uploader': 'FREE THE NIPPLE', 'uploader_id': 'freethenipple', }, - } + }, { + 'url': 'https://twitter.com/giphz/status/657991469417025536/photo/1', + 'md5': 'f36dcd5fb92bf7057f155e7d927eeb42', + 'info_dict': { + 'id': '657991469417025536', + 'ext': 'mp4', + 'title': 'Gifs - tu vai cai tu vai cai tu nao eh capaz disso tu vai cai', + 'description': 'Gifs on Twitter: "tu vai cai tu vai cai tu nao eh capaz disso tu vai cai https://t.co/tM46VHFlO5"', + 'thumbnail': 're:^https?://.*\.png', + 'uploader': 'Gifs', + 'uploader_id': 'giphz', + }, + }, { + 'url': 'https://twitter.com/starwars/status/665052190608723968', + 'md5': '39b7199856dee6cd4432e72c74bc69d4', + 'info_dict': { + 'id': '665052190608723968', + 'ext': 'mp4', + 'title': 'Star Wars - A new beginning is coming December 18. Watch the official 60 second #TV spot for #StarWars: #TheForceAwakens.', + 'description': 'Star Wars on Twitter: "A new beginning is coming December 18. Watch the official 60 second #TV spot for #StarWars: #TheForceAwakens."', + 'uploader_id': 'starwars', + 'uploader': 'Star Wars', + }, + }] def _real_extract(self, url): mobj = re.match(self._VALID_URL, url) @@ -144,23 +183,46 @@ class TwitterIE(InfoExtractor): username = remove_end(self._og_search_title(webpage), ' on Twitter') - title = self._og_search_description(webpage).strip('').replace('\n', ' ') + title = description = self._og_search_description(webpage).strip('').replace('\n', ' ').strip('“”') # strip 'https -_t.co_BJYgOjSeGA' junk from filenames - mobj = re.match(r'“(.*)\s+(https?://[^ ]+)”', title) - title, short_url = mobj.groups() - - card_id = self._search_regex( - r'["\']/i/cards/tfw/v1/(\d+)', webpage, 'twitter card url') - card_url = 'https://twitter.com/i/cards/tfw/v1/' + card_id + title = re.sub(r'\s+(https?://[^ ]+)', '', title) - return { - '_type': 'url_transparent', - 'ie_key': 'TwitterCard', + info = { 'uploader_id': user_id, 'uploader': username, - 'url': card_url, 'webpage_url': url, - 'description': '%s on Twitter: "%s %s"' % (username, title, short_url), + 'description': '%s on Twitter: "%s"' % (username, description), 'title': username + ' - ' + title, } + + card_id = self._search_regex( + r'["\']/i/cards/tfw/v1/(\d+)', webpage, 'twitter card url', default=None) + if card_id: + card_url = 'https://twitter.com/i/cards/tfw/v1/' + card_id + info.update({ + '_type': 'url_transparent', + 'ie_key': 'TwitterCard', + 'url': card_url, + }) + return info + + mobj = re.search(r'''(?x) + <video[^>]+class="animated-gif"[^>]+ + (?:data-height="(?P<height>\d+)")?[^>]+ + (?:data-width="(?P<width>\d+)")?[^>]+ + (?:poster="(?P<poster>[^"]+)")?[^>]*>\s* + <source[^>]+video-src="(?P<url>[^"]+)" + ''', webpage) + + if mobj: + info.update({ + 'id': twid, + 'url': mobj.group('url'), + 'height': int_or_none(mobj.group('height')), + 'width': int_or_none(mobj.group('width')), + 'thumbnail': mobj.group('poster'), + }) + return info + + raise ExtractorError('There\'s not video in this tweet.') diff --git a/youtube_dl/extractor/udemy.py b/youtube_dl/extractor/udemy.py index 365d8b4..8251728 100644 --- a/youtube_dl/extractor/udemy.py +++ b/youtube_dl/extractor/udemy.py @@ -9,6 +9,7 @@ from ..compat import ( ) from ..utils import ( ExtractorError, + sanitized_Request, ) @@ -58,7 +59,7 @@ class UdemyIE(InfoExtractor): for header, value in headers.items(): url_or_request.add_header(header, value) else: - url_or_request = compat_urllib_request.Request(url_or_request, headers=headers) + url_or_request = sanitized_Request(url_or_request, headers=headers) response = super(UdemyIE, self)._download_json(url_or_request, video_id, note) self._handle_error(response) @@ -89,7 +90,7 @@ class UdemyIE(InfoExtractor): 'password': password.encode('utf-8'), }) - request = compat_urllib_request.Request( + request = sanitized_Request( self._LOGIN_URL, compat_urllib_parse.urlencode(login_form).encode('utf-8')) request.add_header('Referer', self._ORIGIN_URL) request.add_header('Origin', self._ORIGIN_URL) diff --git a/youtube_dl/extractor/udn.py b/youtube_dl/extractor/udn.py index 2151f83..ee35b72 100644 --- a/youtube_dl/extractor/udn.py +++ b/youtube_dl/extractor/udn.py @@ -12,7 +12,8 @@ from ..compat import compat_urlparse class UDNEmbedIE(InfoExtractor): IE_DESC = '聯合影音' - _VALID_URL = r'https?://video\.udn\.com/(?:embed|play)/news/(?P<id>\d+)' + _PROTOCOL_RELATIVE_VALID_URL = r'//video\.udn\.com/(?:embed|play)/news/(?P<id>\d+)' + _VALID_URL = r'https?:' + _PROTOCOL_RELATIVE_VALID_URL _TESTS = [{ 'url': 'http://video.udn.com/embed/news/300040', 'md5': 'de06b4c90b042c128395a88f0384817e', diff --git a/youtube_dl/extractor/vbox7.py b/youtube_dl/extractor/vbox7.py index 722eb52..1e740fb 100644 --- a/youtube_dl/extractor/vbox7.py +++ b/youtube_dl/extractor/vbox7.py @@ -4,11 +4,11 @@ from __future__ import unicode_literals from .common import InfoExtractor from ..compat import ( compat_urllib_parse, - compat_urllib_request, compat_urlparse, ) from ..utils import ( ExtractorError, + sanitized_Request, ) @@ -49,7 +49,7 @@ class Vbox7IE(InfoExtractor): info_url = "http://vbox7.com/play/magare.do" data = compat_urllib_parse.urlencode({'as3': '1', 'vid': video_id}) - info_request = compat_urllib_request.Request(info_url, data) + info_request = sanitized_Request(info_url, data) info_request.add_header('Content-Type', 'application/x-www-form-urlencoded') info_response = self._download_webpage(info_request, video_id, 'Downloading info webpage') if info_response is None: diff --git a/youtube_dl/extractor/veoh.py b/youtube_dl/extractor/veoh.py index 01e258e..9633f7f 100644 --- a/youtube_dl/extractor/veoh.py +++ b/youtube_dl/extractor/veoh.py @@ -4,12 +4,10 @@ import re import json from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, -) from ..utils import ( int_or_none, ExtractorError, + sanitized_Request, ) @@ -110,7 +108,7 @@ class VeohIE(InfoExtractor): if 'class="adultwarning-container"' in webpage: self.report_age_confirmation() age_limit = 18 - request = compat_urllib_request.Request(url) + request = sanitized_Request(url) request.add_header('Cookie', 'confirmedAdult=true') webpage = self._download_webpage(request, video_id) diff --git a/youtube_dl/extractor/vessel.py b/youtube_dl/extractor/vessel.py index 3c8d2a9..1a0ff33 100644 --- a/youtube_dl/extractor/vessel.py +++ b/youtube_dl/extractor/vessel.py @@ -4,10 +4,10 @@ from __future__ import unicode_literals import json from .common import InfoExtractor -from ..compat import compat_urllib_request from ..utils import ( ExtractorError, parse_iso8601, + sanitized_Request, ) @@ -33,7 +33,7 @@ class VesselIE(InfoExtractor): @staticmethod def make_json_request(url, data): payload = json.dumps(data).encode('utf-8') - req = compat_urllib_request.Request(url, payload) + req = sanitized_Request(url, payload) req.add_header('Content-Type', 'application/json; charset=utf-8') return req diff --git a/youtube_dl/extractor/vevo.py b/youtube_dl/extractor/vevo.py index 4c0de35..5712894 100644 --- a/youtube_dl/extractor/vevo.py +++ b/youtube_dl/extractor/vevo.py @@ -3,13 +3,11 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_etree_fromstring, - compat_urllib_request, -) +from ..compat import compat_etree_fromstring from ..utils import ( ExtractorError, int_or_none, + sanitized_Request, ) @@ -73,7 +71,7 @@ class VevoIE(InfoExtractor): _SMIL_BASE_URL = 'http://smil.lvl3.vevo.com/' def _real_initialize(self): - req = compat_urllib_request.Request( + req = sanitized_Request( 'http://www.vevo.com/auth', data=b'') webpage = self._download_webpage( req, None, diff --git a/youtube_dl/extractor/viddler.py b/youtube_dl/extractor/viddler.py index 8516a29..40ffbad 100644 --- a/youtube_dl/extractor/viddler.py +++ b/youtube_dl/extractor/viddler.py @@ -4,9 +4,7 @@ from .common import InfoExtractor from ..utils import ( float_or_none, int_or_none, -) -from ..compat import ( - compat_urllib_request + sanitized_Request, ) @@ -65,7 +63,7 @@ class ViddlerIE(InfoExtractor): 'http://api.viddler.com/api/v2/viddler.videos.getPlaybackDetails.json?video_id=%s&key=v0vhrt7bg2xq1vyxhkct' % video_id) headers = {'Referer': 'http://static.cdn-ec.viddler.com/js/arpeggio/v2/embed.html'} - request = compat_urllib_request.Request(json_url, None, headers) + request = sanitized_Request(json_url, None, headers) data = self._download_json(request, video_id)['video'] formats = [] diff --git a/youtube_dl/extractor/videomega.py b/youtube_dl/extractor/videomega.py index 78ff631..87aca32 100644 --- a/youtube_dl/extractor/videomega.py +++ b/youtube_dl/extractor/videomega.py @@ -4,7 +4,7 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import compat_urllib_request +from ..utils import sanitized_Request class VideoMegaIE(InfoExtractor): @@ -30,7 +30,7 @@ class VideoMegaIE(InfoExtractor): video_id = self._match_id(url) iframe_url = 'http://videomega.tv/cdn.php?ref=%s' % video_id - req = compat_urllib_request.Request(iframe_url) + req = sanitized_Request(iframe_url) req.add_header('Referer', url) req.add_header('Cookie', 'noadvtday=0') webpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/viewster.py b/youtube_dl/extractor/viewster.py index 7cf930d..185b1c1 100644 --- a/youtube_dl/extractor/viewster.py +++ b/youtube_dl/extractor/viewster.py @@ -4,7 +4,6 @@ from __future__ import unicode_literals from .common import InfoExtractor from ..compat import ( compat_HTTPError, - compat_urllib_request, compat_urllib_parse, compat_urllib_parse_unquote, ) @@ -13,6 +12,7 @@ from ..utils import ( ExtractorError, int_or_none, parse_iso8601, + sanitized_Request, HEADRequest, ) @@ -76,7 +76,7 @@ class ViewsterIE(InfoExtractor): _ACCEPT_HEADER = 'application/json, text/javascript, */*; q=0.01' def _download_json(self, url, video_id, note='Downloading JSON metadata', fatal=True): - request = compat_urllib_request.Request(url) + request = sanitized_Request(url) request.add_header('Accept', self._ACCEPT_HEADER) request.add_header('Auth-token', self._AUTH_TOKEN) return super(ViewsterIE, self)._download_json(request, video_id, note, fatal=fatal) diff --git a/youtube_dl/extractor/viki.py b/youtube_dl/extractor/viki.py index ddbd395..a63c236 100644 --- a/youtube_dl/extractor/viki.py +++ b/youtube_dl/extractor/viki.py @@ -7,14 +7,14 @@ import hmac import hashlib import itertools +from .common import InfoExtractor from ..utils import ( ExtractorError, int_or_none, parse_age_limit, parse_iso8601, + sanitized_Request, ) -from ..compat import compat_urllib_request -from .common import InfoExtractor class VikiBaseIE(InfoExtractor): @@ -43,7 +43,7 @@ class VikiBaseIE(InfoExtractor): hashlib.sha1 ).hexdigest() url = self._API_URL_TEMPLATE % (query, sig) - return compat_urllib_request.Request( + return sanitized_Request( url, json.dumps(post_data).encode('utf-8')) if post_data else url def _call_api(self, path, video_id, note, timestamp=None, post_data=None): diff --git a/youtube_dl/extractor/vimeo.py b/youtube_dl/extractor/vimeo.py index ca716c8..f392ccf 100644 --- a/youtube_dl/extractor/vimeo.py +++ b/youtube_dl/extractor/vimeo.py @@ -8,7 +8,6 @@ import itertools from .common import InfoExtractor from ..compat import ( compat_HTTPError, - compat_urllib_request, compat_urlparse, ) from ..utils import ( @@ -17,6 +16,7 @@ from ..utils import ( InAdvancePagedList, int_or_none, RegexNotFoundError, + sanitized_Request, smuggle_url, std_headers, unified_strdate, @@ -47,10 +47,10 @@ class VimeoBaseInfoExtractor(InfoExtractor): 'service': 'vimeo', 'token': token, })) - login_request = compat_urllib_request.Request(self._LOGIN_URL, data) + login_request = sanitized_Request(self._LOGIN_URL, data) login_request.add_header('Content-Type', 'application/x-www-form-urlencoded') - login_request.add_header('Cookie', 'vuid=%s' % vuid) login_request.add_header('Referer', self._LOGIN_URL) + self._set_vimeo_cookie('vuid', vuid) self._download_webpage(login_request, None, False, 'Wrong login info') def _extract_xsrft_and_vuid(self, webpage): @@ -62,6 +62,9 @@ class VimeoBaseInfoExtractor(InfoExtractor): webpage, 'vuid', group='vuid') return xsrft, vuid + def _set_vimeo_cookie(self, name, value): + self._set_cookie('vimeo.com', name, value) + class VimeoIE(VimeoBaseInfoExtractor): """Information extractor for vimeo.com.""" @@ -186,6 +189,10 @@ class VimeoIE(VimeoBaseInfoExtractor): 'note': 'Video not completely processed, "failed" seed status', 'only_matching': True, }, + { + 'url': 'https://vimeo.com/groups/travelhd/videos/22439234', + 'only_matching': True, + }, ] @staticmethod @@ -215,10 +222,10 @@ class VimeoIE(VimeoBaseInfoExtractor): if url.startswith('http://'): # vimeo only supports https now, but the user can give an http url url = url.replace('http://', 'https://') - password_request = compat_urllib_request.Request(url + '/password', data) + password_request = sanitized_Request(url + '/password', data) password_request.add_header('Content-Type', 'application/x-www-form-urlencoded') - password_request.add_header('Cookie', 'clip_test2=1; vuid=%s' % vuid) password_request.add_header('Referer', url) + self._set_vimeo_cookie('vuid', vuid) return self._download_webpage( password_request, video_id, 'Verifying the password', 'Wrong password') @@ -229,7 +236,7 @@ class VimeoIE(VimeoBaseInfoExtractor): raise ExtractorError('This video is protected by a password, use the --video-password option') data = urlencode_postdata(encode_dict({'password': password})) pass_url = url + '/check-password' - password_request = compat_urllib_request.Request(pass_url, data) + password_request = sanitized_Request(pass_url, data) password_request.add_header('Content-Type', 'application/x-www-form-urlencoded') return self._download_json( password_request, video_id, @@ -258,7 +265,7 @@ class VimeoIE(VimeoBaseInfoExtractor): url = 'https://vimeo.com/' + video_id # Retrieve video webpage to extract further information - request = compat_urllib_request.Request(url, None, headers) + request = sanitized_Request(url, None, headers) try: webpage = self._download_webpage(request, video_id) except ExtractorError as ee: @@ -384,47 +391,29 @@ class VimeoIE(VimeoBaseInfoExtractor): like_count = None comment_count = None - # Vimeo specific: extract request signature and timestamp - sig = config['request']['signature'] - timestamp = config['request']['timestamp'] - - # Vimeo specific: extract video codec and quality information - # First consider quality, then codecs, then take everything - codecs = [('vp6', 'flv'), ('vp8', 'flv'), ('h264', 'mp4')] - files = {'hd': [], 'sd': [], 'other': []} - config_files = config["video"].get("files") or config["request"].get("files") - for codec_name, codec_extension in codecs: - for quality in config_files.get(codec_name, []): - format_id = '-'.join((codec_name, quality)).lower() - key = quality if quality in files else 'other' - video_url = None - if isinstance(config_files[codec_name], dict): - file_info = config_files[codec_name][quality] - video_url = file_info.get('url') - else: - file_info = {} - if video_url is None: - video_url = "http://player.vimeo.com/play_redirect?clip_id=%s&sig=%s&time=%s&quality=%s&codecs=%s&type=moogaloop_local&embed_location=" \ - % (video_id, sig, timestamp, quality, codec_name.upper()) - - files[key].append({ - 'ext': codec_extension, - 'url': video_url, - 'format_id': format_id, - 'width': int_or_none(file_info.get('width')), - 'height': int_or_none(file_info.get('height')), - 'tbr': int_or_none(file_info.get('bitrate')), - }) formats = [] - m3u8_url = config_files.get('hls', {}).get('all') + config_files = config['video'].get('files') or config['request'].get('files', {}) + for f in config_files.get('progressive', []): + video_url = f.get('url') + if not video_url: + continue + formats.append({ + 'url': video_url, + 'format_id': 'http-%s' % f.get('quality'), + 'width': int_or_none(f.get('width')), + 'height': int_or_none(f.get('height')), + 'fps': int_or_none(f.get('fps')), + 'tbr': int_or_none(f.get('bitrate')), + }) + m3u8_url = config_files.get('hls', {}).get('url') if m3u8_url: m3u8_formats = self._extract_m3u8_formats( m3u8_url, video_id, 'mp4', 'm3u8_native', 0, 'hls', fatal=False) if m3u8_formats: formats.extend(m3u8_formats) - for key in ('other', 'sd', 'hd'): - formats += files[key] - self._sort_formats(formats) + # Bitrates are completely broken. Single m3u8 may contain entries in kbps and bps + # at the same time without actual units specified. This lead to wrong sorting. + self._sort_formats(formats, field_preference=('height', 'width', 'fps', 'format_id')) subtitles = {} text_tracks = config['request'].get('text_tracks') @@ -492,17 +481,16 @@ class VimeoChannelIE(VimeoBaseInfoExtractor): password_path = self._search_regex( r'action="([^"]+)"', login_form, 'password URL') password_url = compat_urlparse.urljoin(page_url, password_path) - password_request = compat_urllib_request.Request(password_url, post) + password_request = sanitized_Request(password_url, post) password_request.add_header('Content-type', 'application/x-www-form-urlencoded') - password_request.add_header('Cookie', 'vuid=%s' % vuid) - self._set_cookie('vimeo.com', 'xsrft', token) + self._set_vimeo_cookie('vuid', vuid) + self._set_vimeo_cookie('xsrft', token) return self._download_webpage( password_request, list_id, 'Verifying the password', 'Wrong password') - def _extract_videos(self, list_id, base_url): - video_ids = [] + def _title_and_entries(self, list_id, base_url): for pagenum in itertools.count(1): page_url = self._page_url(base_url, pagenum) webpage = self._download_webpage( @@ -511,18 +499,18 @@ class VimeoChannelIE(VimeoBaseInfoExtractor): if pagenum == 1: webpage = self._login_list_password(page_url, list_id, webpage) + yield self._extract_list_title(webpage) + + for video_id in re.findall(r'id="clip_(\d+?)"', webpage): + yield self.url_result('https://vimeo.com/%s' % video_id, 'Vimeo') - video_ids.extend(re.findall(r'id="clip_(\d+?)"', webpage)) if re.search(self._MORE_PAGES_INDICATOR, webpage, re.DOTALL) is None: break - entries = [self.url_result('https://vimeo.com/%s' % video_id, 'Vimeo') - for video_id in video_ids] - return {'_type': 'playlist', - 'id': list_id, - 'title': self._extract_list_title(webpage), - 'entries': entries, - } + def _extract_videos(self, list_id, base_url): + title_and_entries = self._title_and_entries(list_id, base_url) + list_title = next(title_and_entries) + return self.playlist_result(title_and_entries, list_id, list_title) def _real_extract(self, url): mobj = re.match(self._VALID_URL, url) @@ -583,7 +571,7 @@ class VimeoAlbumIE(VimeoChannelIE): class VimeoGroupsIE(VimeoAlbumIE): IE_NAME = 'vimeo:group' - _VALID_URL = r'https://vimeo\.com/groups/(?P<name>[^/]+)' + _VALID_URL = r'https://vimeo\.com/groups/(?P<name>[^/]+)(?:/(?!videos?/\d+)|$)' _TESTS = [{ 'url': 'https://vimeo.com/groups/rolexawards', 'info_dict': { @@ -652,7 +640,7 @@ class VimeoWatchLaterIE(VimeoChannelIE): def _page_url(self, base_url, pagenum): url = '%s/page:%d/' % (base_url, pagenum) - request = compat_urllib_request.Request(url) + request = sanitized_Request(url) # Set the header to get a partial html page with the ids, # the normal page doesn't contain them. request.add_header('X-Requested-With', 'XMLHttpRequest') diff --git a/youtube_dl/extractor/vk.py b/youtube_dl/extractor/vk.py index 01960b8..d99a42a 100644 --- a/youtube_dl/extractor/vk.py +++ b/youtube_dl/extractor/vk.py @@ -8,11 +8,11 @@ from .common import InfoExtractor from ..compat import ( compat_str, compat_urllib_parse, - compat_urllib_request, ) from ..utils import ( ExtractorError, orderedSet, + sanitized_Request, str_to_int, unescapeHTML, unified_strdate, @@ -182,7 +182,7 @@ class VKIE(InfoExtractor): 'pass': password.encode('cp1251'), }) - request = compat_urllib_request.Request( + request = sanitized_Request( 'https://login.vk.com/?act=login', compat_urllib_parse.urlencode(login_form).encode('utf-8')) login_page = self._download_webpage( diff --git a/youtube_dl/extractor/vodlocker.py b/youtube_dl/extractor/vodlocker.py index ccf1928..be0a278 100644 --- a/youtube_dl/extractor/vodlocker.py +++ b/youtube_dl/extractor/vodlocker.py @@ -2,10 +2,8 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse +from ..utils import sanitized_Request class VodlockerIE(InfoExtractor): @@ -31,7 +29,7 @@ class VodlockerIE(InfoExtractor): if fields['op'] == 'download1': self._sleep(3, video_id) # they do detect when requests happen too fast! post = compat_urllib_parse.urlencode(fields) - req = compat_urllib_request.Request(url, post) + req = sanitized_Request(url, post) req.add_header('Content-type', 'application/x-www-form-urlencoded') webpage = self._download_webpage( req, video_id, 'Downloading video page') diff --git a/youtube_dl/extractor/voicerepublic.py b/youtube_dl/extractor/voicerepublic.py index 254383d..93d15a5 100644 --- a/youtube_dl/extractor/voicerepublic.py +++ b/youtube_dl/extractor/voicerepublic.py @@ -3,14 +3,12 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, - compat_urlparse, -) +from ..compat import compat_urlparse from ..utils import ( ExtractorError, determine_ext, int_or_none, + sanitized_Request, ) @@ -37,7 +35,7 @@ class VoiceRepublicIE(InfoExtractor): def _real_extract(self, url): display_id = self._match_id(url) - req = compat_urllib_request.Request( + req = sanitized_Request( compat_urlparse.urljoin(url, '/talks/%s' % display_id)) # Older versions of Firefox get redirected to an "upgrade browser" page req.add_header('User-Agent', 'youtube-dl') diff --git a/youtube_dl/extractor/wistia.py b/youtube_dl/extractor/wistia.py index 13a0791..fdb16d9 100644 --- a/youtube_dl/extractor/wistia.py +++ b/youtube_dl/extractor/wistia.py @@ -1,8 +1,10 @@ from __future__ import unicode_literals from .common import InfoExtractor -from ..compat import compat_urllib_request -from ..utils import ExtractorError +from ..utils import ( + ExtractorError, + sanitized_Request, +) class WistiaIE(InfoExtractor): @@ -23,7 +25,7 @@ class WistiaIE(InfoExtractor): def _real_extract(self, url): video_id = self._match_id(url) - request = compat_urllib_request.Request(self._API_URL.format(video_id)) + request = sanitized_Request(self._API_URL.format(video_id)) request.add_header('Referer', url) # Some videos require this. data_json = self._download_json(request, video_id) if data_json.get('error'): diff --git a/youtube_dl/extractor/wsj.py b/youtube_dl/extractor/wsj.py index 2ddf29a..5a89737 100644 --- a/youtube_dl/extractor/wsj.py +++ b/youtube_dl/extractor/wsj.py @@ -84,6 +84,5 @@ class WSJIE(InfoExtractor): 'duration': duration, 'upload_date': upload_date, 'title': title, - 'formats': formats, 'categories': categories, } diff --git a/youtube_dl/extractor/gorillavid.py b/youtube_dl/extractor/xfileshare.py similarity index 79% rename from youtube_dl/extractor/gorillavid.py rename to youtube_dl/extractor/xfileshare.py index d23e3ea..a3236e6 100644 --- a/youtube_dl/extractor/gorillavid.py +++ b/youtube_dl/extractor/xfileshare.py @@ -1,25 +1,23 @@ -# -*- coding: utf-8 -*- +# coding: utf-8 from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse, - compat_urllib_request, -) +from ..compat import compat_urllib_parse from ..utils import ( ExtractorError, encode_dict, int_or_none, + sanitized_Request, ) -class GorillaVidIE(InfoExtractor): - IE_DESC = 'GorillaVid.in, daclips.in, movpod.in, fastvideo.in, realvid.net and filehoot.com' +class XFileShareIE(InfoExtractor): + IE_DESC = 'XFileShare based sites: GorillaVid.in, daclips.in, movpod.in, fastvideo.in, realvid.net, filehoot.com and vidto.me' _VALID_URL = r'''(?x) https?://(?P<host>(?:www\.)? - (?:daclips\.in|gorillavid\.in|movpod\.in|fastvideo\.in|realvid\.net|filehoot\.com))/ + (?:daclips\.in|gorillavid\.in|movpod\.in|fastvideo\.in|realvid\.net|filehoot\.com|vidto\.me))/ (?:embed-)?(?P<id>[0-9a-zA-Z]+)(?:-[0-9]+x[0-9]+\.html)? ''' @@ -76,6 +74,13 @@ class GorillaVidIE(InfoExtractor): 'title': 'youtube-dl test video \'äBaW_jenozKc.mp4.mp4', 'thumbnail': 're:http://.*\.jpg', } + }, { + 'url': 'http://vidto.me/ku5glz52nqe1.html', + 'info_dict': { + 'id': 'ku5glz52nqe1', + 'ext': 'mp4', + 'title': 'test' + } }] def _real_extract(self, url): @@ -99,18 +104,23 @@ class GorillaVidIE(InfoExtractor): post = compat_urllib_parse.urlencode(encode_dict(fields)) - req = compat_urllib_request.Request(url, post) + req = sanitized_Request(url, post) req.add_header('Content-type', 'application/x-www-form-urlencoded') webpage = self._download_webpage(req, video_id, 'Downloading video page') - title = self._search_regex( - [r'style="z-index: [0-9]+;">([^<]+)</span>', r'<td nowrap>([^<]+)</td>', r'>Watch (.+) '], - webpage, 'title', default=None) or self._og_search_title(webpage) + title = (self._search_regex( + [r'style="z-index: [0-9]+;">([^<]+)</span>', + r'<td nowrap>([^<]+)</td>', + r'>Watch (.+) ', + r'<h2 class="video-page-head">([^<]+)</h2>'], + webpage, 'title', default=None) or self._og_search_title(webpage)).strip() video_url = self._search_regex( - r'file\s*:\s*["\'](http[^"\']+)["\'],', webpage, 'file url') + [r'file\s*:\s*["\'](http[^"\']+)["\'],', + r'file_link\s*=\s*\'(https?:\/\/[0-9a-zA-z.\/\-_]+)'], + webpage, 'file url') thumbnail = self._search_regex( - r'image\s*:\s*["\'](http[^"\']+)["\'],', webpage, 'thumbnail', fatal=False) + r'image\s*:\s*["\'](http[^"\']+)["\'],', webpage, 'thumbnail', default=None) formats = [{ 'format_id': 'sd', diff --git a/youtube_dl/extractor/xtube.py b/youtube_dl/extractor/xtube.py index 779e4f4..a1fe240 100644 --- a/youtube_dl/extractor/xtube.py +++ b/youtube_dl/extractor/xtube.py @@ -3,12 +3,10 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_request, - compat_urllib_parse_unquote, -) +from ..compat import compat_urllib_parse_unquote from ..utils import ( parse_duration, + sanitized_Request, str_to_int, ) @@ -32,7 +30,7 @@ class XTubeIE(InfoExtractor): def _real_extract(self, url): video_id = self._match_id(url) - req = compat_urllib_request.Request(url) + req = sanitized_Request(url) req.add_header('Cookie', 'age_verified=1') webpage = self._download_webpage(req, video_id) diff --git a/youtube_dl/extractor/xvideos.py b/youtube_dl/extractor/xvideos.py index 5dcf2fd..710ad50 100644 --- a/youtube_dl/extractor/xvideos.py +++ b/youtube_dl/extractor/xvideos.py @@ -3,14 +3,12 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import ( - compat_urllib_parse_unquote, - compat_urllib_request, -) +from ..compat import compat_urllib_parse_unquote from ..utils import ( clean_html, ExtractorError, determine_ext, + sanitized_Request, ) @@ -48,7 +46,7 @@ class XVideosIE(InfoExtractor): 'url': video_url, }] - android_req = compat_urllib_request.Request(url) + android_req = sanitized_Request(url) android_req.add_header('User-Agent', self._ANDROID_USER_AGENT) android_webpage = self._download_webpage(android_req, video_id, fatal=False) diff --git a/youtube_dl/extractor/yandexmusic.py b/youtube_dl/extractor/yandexmusic.py index 08dc81f..d3cc1a2 100644 --- a/youtube_dl/extractor/yandexmusic.py +++ b/youtube_dl/extractor/yandexmusic.py @@ -8,11 +8,11 @@ from .common import InfoExtractor from ..compat import ( compat_str, compat_urllib_parse, - compat_urllib_request, ) from ..utils import ( int_or_none, float_or_none, + sanitized_Request, ) @@ -154,7 +154,7 @@ class YandexMusicPlaylistIE(YandexMusicPlaylistBaseIE): if len(tracks) < len(track_ids): present_track_ids = set([compat_str(track['id']) for track in tracks if track.get('id')]) missing_track_ids = set(map(compat_str, track_ids)) - set(present_track_ids) - request = compat_urllib_request.Request( + request = sanitized_Request( 'https://music.yandex.ru/handlers/track-entries.jsx', compat_urllib_parse.urlencode({ 'entries': ','.join(missing_track_ids), diff --git a/youtube_dl/extractor/youku.py b/youtube_dl/extractor/youku.py index 2e81d92..69ecc83 100644 --- a/youtube_dl/extractor/youku.py +++ b/youtube_dl/extractor/youku.py @@ -4,12 +4,13 @@ from __future__ import unicode_literals import base64 from .common import InfoExtractor -from ..utils import ExtractorError - from ..compat import ( compat_urllib_parse, compat_ord, - compat_urllib_request, +) +from ..utils import ( + ExtractorError, + sanitized_Request, ) @@ -187,7 +188,7 @@ class YoukuIE(InfoExtractor): video_id = self._match_id(url) def retrieve_data(req_url, note): - req = compat_urllib_request.Request(req_url) + req = sanitized_Request(req_url) cn_verification_proxy = self._downloader.params.get('cn_verification_proxy') if cn_verification_proxy: diff --git a/youtube_dl/extractor/youporn.py b/youtube_dl/extractor/youporn.py index 9bf8d1e..dd72408 100644 --- a/youtube_dl/extractor/youporn.py +++ b/youtube_dl/extractor/youporn.py @@ -3,9 +3,9 @@ from __future__ import unicode_literals import re from .common import InfoExtractor -from ..compat import compat_urllib_request from ..utils import ( int_or_none, + sanitized_Request, str_to_int, unescapeHTML, unified_strdate, @@ -63,7 +63,7 @@ class YouPornIE(InfoExtractor): video_id = mobj.group('id') display_id = mobj.group('display_id') - request = compat_urllib_request.Request(url) + request = sanitized_Request(url) request.add_header('Cookie', 'age_verified=1') webpage = self._download_webpage(request, display_id) diff --git a/youtube_dl/extractor/youtube.py b/youtube_dl/extractor/youtube.py index 687e0b4..cfe9eed 100644 --- a/youtube_dl/extractor/youtube.py +++ b/youtube_dl/extractor/youtube.py @@ -20,7 +20,6 @@ from ..compat import ( compat_urllib_parse_unquote, compat_urllib_parse_unquote_plus, compat_urllib_parse_urlparse, - compat_urllib_request, compat_urlparse, compat_str, ) @@ -35,6 +34,7 @@ from ..utils import ( orderedSet, parse_duration, remove_start, + sanitized_Request, smuggle_url, str_to_int, unescapeHTML, @@ -114,7 +114,7 @@ class YoutubeBaseInfoExtractor(InfoExtractor): login_data = compat_urllib_parse.urlencode(encode_dict(login_form_strs)).encode('ascii') - req = compat_urllib_request.Request(self._LOGIN_URL, login_data) + req = sanitized_Request(self._LOGIN_URL, login_data) login_results = self._download_webpage( req, None, note='Logging in', errnote='unable to log in', fatal=False) @@ -147,7 +147,7 @@ class YoutubeBaseInfoExtractor(InfoExtractor): tfa_data = compat_urllib_parse.urlencode(encode_dict(tfa_form_strs)).encode('ascii') - tfa_req = compat_urllib_request.Request(self._TWOFACTOR_URL, tfa_data) + tfa_req = sanitized_Request(self._TWOFACTOR_URL, tfa_data) tfa_results = self._download_webpage( tfa_req, None, note='Submitting TFA code', errnote='unable to submit tfa', fatal=False) @@ -178,15 +178,13 @@ class YoutubeBaseInfoExtractor(InfoExtractor): return -class YoutubePlaylistBaseInfoExtractor(InfoExtractor): - # Extract the video ids from the playlist pages +class YoutubeEntryListBaseInfoExtractor(InfoExtractor): + # Extract entries from page with "Load more" button def _entries(self, page, playlist_id): more_widget_html = content_html = page for page_num in itertools.count(1): - for video_id, video_title in self.extract_videos_from_page(content_html): - yield self.url_result( - video_id, 'Youtube', video_id=video_id, - video_title=video_title) + for entry in self._process_page(content_html): + yield entry mobj = re.search(r'data-uix-load-more-href="/?(?P<more>[^"]+)"', more_widget_html) if not mobj: @@ -203,6 +201,12 @@ class YoutubePlaylistBaseInfoExtractor(InfoExtractor): break more_widget_html = more['load_more_widget_html'] + +class YoutubePlaylistBaseInfoExtractor(YoutubeEntryListBaseInfoExtractor): + def _process_page(self, content): + for video_id, video_title in self.extract_videos_from_page(content): + yield self.url_result(video_id, 'Youtube', video_id, video_title) + def extract_videos_from_page(self, page): ids_in_page = [] titles_in_page = [] @@ -224,6 +228,19 @@ class YoutubePlaylistBaseInfoExtractor(InfoExtractor): return zip(ids_in_page, titles_in_page) +class YoutubePlaylistsBaseInfoExtractor(YoutubeEntryListBaseInfoExtractor): + def _process_page(self, content): + for playlist_id in re.findall(r'href="/?playlist\?list=(.+?)"', content): + yield self.url_result( + 'https://www.youtube.com/playlist?list=%s' % playlist_id, 'YoutubePlaylist') + + def _real_extract(self, url): + playlist_id = self._match_id(url) + webpage = self._download_webpage(url, playlist_id) + title = self._og_search_title(webpage, fatal=False) + return self.playlist_result(self._entries(webpage, playlist_id), playlist_id, title) + + class YoutubeIE(YoutubeBaseInfoExtractor): IE_DESC = 'YouTube.com' _VALID_URL = r"""(?x)^ @@ -409,7 +426,8 @@ class YoutubeIE(YoutubeBaseInfoExtractor): 'title': 'Principal Sexually Assaults A Teacher - Episode 117 - 8th June 2012', 'description': 'md5:09b78bd971f1e3e289601dfba15ca4f7', 'uploader': 'SET India', - 'uploader_id': 'setindia' + 'uploader_id': 'setindia', + 'age_limit': 18, } }, { @@ -546,7 +564,7 @@ class YoutubeIE(YoutubeBaseInfoExtractor): 'info_dict': { 'id': 'lqQg6PlCWgI', 'ext': 'mp4', - 'upload_date': '20120724', + 'upload_date': '20150827', 'uploader_id': 'olympic', 'description': 'HO09 - Women - GER-AUS - Hockey - 31 July 2012 - London 2012 Olympic Games', 'uploader': 'Olympics', @@ -674,7 +692,28 @@ class YoutubeIE(YoutubeBaseInfoExtractor): { 'url': 'http://vid.plus/FlRa-iH7PGw', 'only_matching': True, - } + }, + { + # Title with JS-like syntax "};" (see https://github.com/rg3/youtube-dl/issues/7468) + 'url': 'https://www.youtube.com/watch?v=lsguqyKfVQg', + 'info_dict': { + 'id': 'lsguqyKfVQg', + 'ext': 'mp4', + 'title': '{dark walk}; Loki/AC/Dishonored; collab w/Elflover21', + 'description': 'md5:8085699c11dc3f597ce0410b0dcbb34a', + 'upload_date': '20151119', + 'uploader_id': 'IronSoulElf', + 'uploader': 'IronSoulElf', + }, + 'params': { + 'skip_download': True, + }, + }, + { + # Tags with '};' (see https://github.com/rg3/youtube-dl/issues/7468) + 'url': 'https://www.youtube.com/watch?v=Ms7iBXnlUO8', + 'only_matching': True, + }, ] def __init__(self, *args, **kwargs): @@ -858,16 +897,33 @@ class YoutubeIE(YoutubeBaseInfoExtractor): return {} return sub_lang_list + def _get_ytplayer_config(self, video_id, webpage): + patterns = ( + # User data may contain arbitrary character sequences that may affect + # JSON extraction with regex, e.g. when '};' is contained the second + # regex won't capture the whole JSON. Yet working around by trying more + # concrete regex first keeping in mind proper quoted string handling + # to be implemented in future that will replace this workaround (see + # https://github.com/rg3/youtube-dl/issues/7468, + # https://github.com/rg3/youtube-dl/pull/7599) + r';ytplayer\.config\s*=\s*({.+?});ytplayer', + r';ytplayer\.config\s*=\s*({.+?});', + ) + config = self._search_regex( + patterns, webpage, 'ytplayer.config', default=None) + if config: + return self._parse_json( + uppercase_escape(config), video_id, fatal=False) + def _get_automatic_captions(self, video_id, webpage): """We need the webpage for getting the captions url, pass it as an argument to speed up the process.""" self.to_screen('%s: Looking for automatic captions' % video_id) - mobj = re.search(r';ytplayer.config = ({.*?});', webpage) + player_config = self._get_ytplayer_config(video_id, webpage) err_msg = 'Couldn\'t find automatic captions for %s' % video_id - if mobj is None: + if not player_config: self._downloader.report_warning(err_msg) return {} - player_config = json.loads(mobj.group(1)) try: args = player_config['args'] caption_url = args['ttsurl'] @@ -1074,10 +1130,8 @@ class YoutubeIE(YoutubeBaseInfoExtractor): age_gate = False video_info = None # Try looking directly into the video webpage - mobj = re.search(r';ytplayer\.config\s*=\s*({.*?});', video_webpage) - if mobj: - json_code = uppercase_escape(mobj.group(1)) - ytplayer_config = json.loads(json_code) + ytplayer_config = self._get_ytplayer_config(video_id, video_webpage) + if ytplayer_config: args = ytplayer_config['args'] if args.get('url_encoded_fmt_stream_map'): # Convert to the same format returned by compat_parse_qs @@ -1615,7 +1669,7 @@ class YoutubePlaylistIE(YoutubeBaseInfoExtractor, YoutubePlaylistBaseInfoExtract self.report_warning('Youtube gives an alert message: ' + match) playlist_title = self._html_search_regex( - r'(?s)<h1 class="pl-header-title[^"]*">\s*(.*?)\s*</h1>', + r'(?s)<h1 class="pl-header-title[^"]*"[^>]*>\s*(.*?)\s*</h1>', page, 'title') return self.playlist_result(self._entries(page, playlist_id), playlist_id, playlist_title) @@ -1742,6 +1796,29 @@ class YoutubeUserIE(YoutubeChannelIE): return super(YoutubeUserIE, cls).suitable(url) +class YoutubeUserPlaylistsIE(YoutubePlaylistsBaseInfoExtractor): + IE_DESC = 'YouTube.com user playlists' + _VALID_URL = r'https?://(?:\w+\.)?youtube\.com/user/(?P<id>[^/]+)/playlists' + IE_NAME = 'youtube:user:playlists' + + _TESTS = [{ + 'url': 'http://www.youtube.com/user/ThirstForScience/playlists', + 'playlist_mincount': 4, + 'info_dict': { + 'id': 'ThirstForScience', + 'title': 'Thirst for Science', + }, + }, { + # with "Load more" button + 'url': 'http://www.youtube.com/user/igorkle1/playlists?view=1&sort=dd', + 'playlist_mincount': 70, + 'info_dict': { + 'id': 'igorkle1', + 'title': 'Игорь Клейнер', + }, + }] + + class YoutubeSearchIE(SearchInfoExtractor, YoutubePlaylistIE): IE_DESC = 'YouTube.com searches' # there doesn't appear to be a real limit, for example if you search for @@ -1837,7 +1914,7 @@ class YoutubeSearchURLIE(InfoExtractor): } -class YoutubeShowIE(InfoExtractor): +class YoutubeShowIE(YoutubePlaylistsBaseInfoExtractor): IE_DESC = 'YouTube.com (multi-season) shows' _VALID_URL = r'https?://www\.youtube\.com/show/(?P<id>[^?#]*)' IE_NAME = 'youtube:show' @@ -1851,26 +1928,9 @@ class YoutubeShowIE(InfoExtractor): }] def _real_extract(self, url): - mobj = re.match(self._VALID_URL, url) - playlist_id = mobj.group('id') - webpage = self._download_webpage( - 'https://www.youtube.com/show/%s/playlists' % playlist_id, playlist_id, 'Downloading show webpage') - # There's one playlist for each season of the show - m_seasons = list(re.finditer(r'href="(/playlist\?list=.*?)"', webpage)) - self.to_screen('%s: Found %s seasons' % (playlist_id, len(m_seasons))) - entries = [ - self.url_result( - 'https://www.youtube.com' + season.group(1), 'YoutubePlaylist') - for season in m_seasons - ] - title = self._og_search_title(webpage, fatal=False) - - return { - '_type': 'playlist', - 'id': playlist_id, - 'title': title, - 'entries': entries, - } + playlist_id = self._match_id(url) + return super(YoutubeShowIE, self)._real_extract( + 'https://www.youtube.com/show/%s/playlists' % playlist_id) class YoutubeFeedsInfoExtractor(YoutubeBaseInfoExtractor): diff --git a/youtube_dl/jsinterp.py b/youtube_dl/jsinterp.py index 9bc8551..2191e8b 100644 --- a/youtube_dl/jsinterp.py +++ b/youtube_dl/jsinterp.py @@ -214,7 +214,7 @@ class JSInterpreter(object): obj = {} obj_m = re.search( (r'(?:var\s+)?%s\s*=\s*\{' % re.escape(objname)) + - r'\s*(?P<fields>([a-zA-Z$0-9]+\s*:\s*function\(.*?\)\s*\{.*?\})*)' + + r'\s*(?P<fields>([a-zA-Z$0-9]+\s*:\s*function\(.*?\)\s*\{.*?\}(?:,\s*)?)*)' + r'\}\s*;', self.code) fields = obj_m.group('fields') diff --git a/youtube_dl/options.py b/youtube_dl/options.py index 3dd6d29..359e8d3 100644 --- a/youtube_dl/options.py +++ b/youtube_dl/options.py @@ -338,7 +338,7 @@ def parseOpts(overrideArguments=None): video_format.add_option( '-F', '--list-formats', action='store_true', dest='listformats', - help='List all available formats') + help='List all available formats of specified videos') video_format.add_option( '--youtube-include-dash-manifest', action='store_true', dest='youtube_include_dash_manifest', default=True, @@ -363,7 +363,7 @@ def parseOpts(overrideArguments=None): subtitles.add_option( '--write-auto-sub', '--write-automatic-sub', action='store_true', dest='writeautomaticsub', default=False, - help='Write automatic subtitle file (YouTube only)') + help='Write automatically generated subtitle file (YouTube only)') subtitles.add_option( '--all-subs', action='store_true', dest='allsubtitles', default=False, diff --git a/youtube_dl/update.py b/youtube_dl/update.py index fc7ac83..074eb64 100644 --- a/youtube_dl/update.py +++ b/youtube_dl/update.py @@ -9,11 +9,8 @@ import subprocess import sys from zipimport import zipimporter -from .compat import ( - compat_str, - compat_urllib_request, -) -from .utils import make_HTTPS_handler +from .compat import compat_str + from .version import __version__ @@ -47,7 +44,7 @@ def rsa_verify(message, signature, key): return True -def update_self(to_screen, verbose): +def update_self(to_screen, verbose, opener): """Update the program file with the latest version from the repository""" UPDATE_URL = "https://rg3.github.io/youtube-dl/update/" @@ -59,9 +56,6 @@ def update_self(to_screen, verbose): to_screen('It looks like you installed youtube-dl with a package manager, pip, setup.py or a tarball. Please use that to update.') return - https_handler = make_HTTPS_handler({}) - opener = compat_urllib_request.build_opener(https_handler) - # Check if there is a new version try: newversion = opener.open(VERSION_URL).read().decode('utf-8').strip() diff --git a/youtube_dl/utils.py b/youtube_dl/utils.py index d39f313..d7b737e 100644 --- a/youtube_dl/utils.py +++ b/youtube_dl/utils.py @@ -373,6 +373,13 @@ def sanitize_path(s): return os.path.join(*sanitized_path) +# Prepend protocol-less URLs with `http:` scheme in order to mitigate the number of +# unwanted failures due to missing protocol +def sanitized_Request(url, *args, **kwargs): + return compat_urllib_request.Request( + 'http:%s' % url if url.startswith('//') else url, *args, **kwargs) + + def orderedSet(iterable): """ Remove all duplicates from the input iterable """ res = [] @@ -396,10 +403,14 @@ def _htmlentity_transform(entity): numstr = '0%s' % numstr else: base = 10 - return compat_chr(int(numstr, base)) + # See https://github.com/rg3/youtube-dl/issues/7518 + try: + return compat_chr(int(numstr, base)) + except ValueError: + pass # Unknown entity in name, return its literal representation - return ('&%s;' % entity) + return '&%s;' % entity def unescapeHTML(s): @@ -921,6 +932,21 @@ def determine_ext(url, default_ext='unknown_video'): guess = url.partition('?')[0].rpartition('.')[2] if re.match(r'^[A-Za-z0-9]+$', guess): return guess + elif guess.rstrip('/') in ( + 'mp4', 'm4a', 'm4p', 'm4b', 'm4r', 'm4v', 'aac', + 'flv', 'f4v', 'f4a', 'f4b', + 'webm', 'ogg', 'ogv', 'oga', 'ogx', 'spx', 'opus', + 'mkv', 'mka', 'mk3d', + 'avi', 'divx', + 'mov', + 'asf', 'wmv', 'wma', + '3gp', '3g2', + 'mp3', + 'flac', + 'ape', + 'wav', + 'f4f', 'f4m', 'm3u8', 'smil'): + return guess.rstrip('/') else: return default_ext @@ -1664,7 +1690,9 @@ def urlencode_postdata(*args, **kargs): def encode_dict(d, encoding='utf-8'): - return dict((k.encode(encoding), v.encode(encoding)) for k, v in d.items()) + def encode(v): + return v.encode(encoding) if isinstance(v, compat_basestring) else v + return dict((encode(k), encode(v)) for k, v in d.items()) US_RATINGS = { diff --git a/youtube_dl/version.py b/youtube_dl/version.py index b3d2540..bd0de9f 100644 --- a/youtube_dl/version.py +++ b/youtube_dl/version.py @@ -1,3 +1,3 @@ from __future__ import unicode_literals -__version__ = '2015.11.10' +__version__ = '2015.11.27.1'