'python' 카테고리의 글 목록

python

[youtube downloader] youtube 크롤링 2021.10.11

[youtube downloader] youtube 크롤링

2021. 10. 11. 23:09

최근 오랜만에 크롤링을 할 일이 있었는데 이번엔 유튜브 비디오를 다운 로드 받는 일이었다.

간단한 서치후 발견한 pytube 라는 모듈이 정말 이 작업을 정말 편하게 해주어 이를 소개해 보고자 한다.

환경

os: ubuntu 20.04

언어 : Python 3.7.11

required modules:

pytube 11.0.1

requests-html 0.10.0

pytube & requests-html

pytube 과 requests-html은 아래 명령어로 쉽게 설치 할수 있다.

pip install pytube
pip install requests-html

pytube는 youtube stream을 다운받는 역할을 하고 requests-html은 HTML의 파싱을 쉽게 해주는 모듈이다.

예제

import http
from requests_html import HTMLSession
import pytube 
from pytube.cli import on_progress 

def download(url, id, save_dir="./downloads"):
    yt = pytube.YouTube(url, on_progress_callback=on_progress) # youtube 오브젝스 생성, on_progess_callback은 video stream 의 chunk 가 다운로드 됐을때 마다 실행되는 함수. 여기서는 프로그래서 바를 그리는 용도
    stream = yt.streams.filter(progressive=True, file_extension="mp4").order_by("resolution").desc().first()# progressive는 스트리밍 서비스의 종류. mp4 포멧의 비디오 파일만 필터링.
    filepath = stream.download(save_dir) #다운로드.
    
if __name__=='__main__':
    s = HTMLSession()
    urls = ['https://youtube.com/playlist?list=PLWo1h5t1i9PHtpcwXa04EWQri_ZZeTFGY']
    total=0
    i=0
    save_dir = './downloads'
    for url in urls:
        r = s.get(url) # 원하는 url에 GET request 를 보낸다. 
        r.html.render(sleep=0, keep_page = True, scrolldown = 10) # get 한 웹페이지를 해석하는 부분이라고 생각하면 됨
        length= len(r.html.find('a#video-title'))
        total+=length
        for links in r.html.find('a#video-title'): #위 url 페이지에 리스팅된 제목에 해당하는 영상들의 정보를 검색
            link = next(iter(links.absolute_links)) # 각 리스팅된 영상중 하나의 url을 받음
            print('link:{}'.format(link))
            try:
                download(link,i,save_dir) # 다울로드
                i+=1
                print(fr' number of videos:{i}/{total}')
            except http.client.IncompleteRead as e:
                print('fail: {} \n'.format(link))
                    
        print(fr'total videos:{i}/{total}')

위 예제 코드는 내가 사용한 코드이다. 각 부분에 주석을 달아 놓았으니 편하게 읽어 보면 된다.

코드에서 $ r.html.find('a#video-title') $ 이 부분은 아래와 같은 페이지에서 리스팅된 제목에 해당하는 영상의 정보를 들고 가지고 잇는 html 요소를 찾아서 리턴해 주는 역할을 한다. 즉 '푸드얍 15초 공식 광고' 와 같은 제목에 해당하는 영상의 정보를 리턴해주는 메서드.

yt.streams.filter(progressive=True, file_extension="mp4").order_by("resolution").desc().first() 라인에서

order_by("resolution").desc().first() 부분의 시청가능한 영상의 화질 중 가장 높은 화질을 다운로드하겠다는 의미이다.

즉, 아래 처럼 유튜브의 화질 설정에서 설정 가능 한 가장 높은 화질 아래 예에서는 720p를 다운로드 하겠다는 의미이다.

위 코드를 돌리면 $ ./downloads $ 디렉토리 아래에 원하는 영상이 다운받아 진다.

PREV 1 NEXT

python

[youtube downloader] youtube 크롤링

환경

pytube & requests-html

예제

+ Recent posts

티스토리툴바