본문 바로가기

Programming/Web Crawling

(14)
iframe부분 크롤링 실습 # 웹 개발자도구에서 해당 iframe을 찾아 src주소를 입력해서 찾아들어가야함. import requests as req from bs4 import BeautifulSoup as bs import pandas as pd url = 'https://movie.naver.com' url_sub = '/movie/bi/mi/pointWriteFormList.nhn?code=181381&type=after&isActualPointWriteExecute=false&isMileageSubscriptionAlready=false&isMileageSubscriptionReject=false' url_final = url + url_sub res = req.get(url_final) soup = bs(res.con..
진행사항을 알려주는 tqdm from tqdm import tqdm_notebook movie_date = [] movie_title = [] movie_rate = [] for day in tqdm_notebook(days): url = "https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&date="+day res = req.get(url) soup = bs(res.content, 'lxml') title = soup.select('div.tit5 > a') rate = soup.find_all('td',class_='point') for index in range(len(title)): movie_date.append(day) movie_title.append(title[in..
한달동안의 영화 평점 수집 import requests as req from bs4 import BeautifulSoup as bs import pandas as pd movie_date = [] movie_title = [] movie_rate = [] for day in range(20191201,20191226,1): url = "https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&tg=0&date="+str(day) res = req.get(url) soup = bs(res.content, 'lxml') title_list = soup.select('div.tit5 > a') rate_list = soup.find_all('td',class_='point') for ind..
영화랭킹 페이지에서 제목, 평점 수집하기 import requests as req from bs4 import BeautifulSoup as bs import pandas as pd url = 'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=cur&date=20191228' res = req.get(url) #파서 종류 : lxml, html.parser, html5lib soup = bs(res.content, 'lxml') name = soup.select('div.tit5 > a') rate = soup.find_all('td',class_='point') len(name),len(rate) #순위, 영화제목, 평점 수집 rank_list = [] name_list = [] rating_l..
인코딩 방식 3 가지 # 인코딩 방식은 아래 3가지 중 한가지 사용 music.to_csv('music50.csv', encoding='euc-kr') music.to_csv('music50_utf8.csv', encoding='utf-8-sig') music.to_csv('music50_utf8.csv', encoding='')
음악 TOP50수집 import requests from bs4 import BeautifulSoup as bs url = 'https://music.naver.com/listen/top100.nhn?domain=TOTAL' res = req.get(url) soup = bs(res.text,'lxml') #select(CSS선택자) : 여러 요소를 검색한 후 리스트로 반환 # --> find_all()과 같음 #select_one(CSS선택자) : 하나의 요소만 반환 # --> find()와 같음 rank_list = soup.find_all('td',class_='ranking') name_list = soup.select('a._title > span') artist_list = soup.select('td.artis..