개발은 처음이라 개발새발

[selenium] 셀레니움으로 크롤링 해보기 - 네이버 축구 순위 최종장 본문

파이썬/크롤링

[selenium] 셀레니움으로 크롤링 해보기 - 네이버 축구 순위 최종장

leon_choi 2022. 6. 6. 15:14
반응형
from selenium import webdriver
import pandas as pd
    
#open webdriver
chrome_driver = './chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)

df_bundes_team = pd.DataFrame(columns = ['rank', 'team', 'game', 'win_pt', 'win', 'draw', 
                                         'lose', 'gf', 'ga', 'goal_diff'])

bundes_football = "https://sports.news.naver.com/wfootball/record/index?category=bundesliga&tab=team"
driver.get(bundes_football)
driver.implicitly_wait(3)

tr_len = len(driver.find_elements_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr'))
print(tr_len)

지난 시간에 크롤링을 하기 위한 첫번째 항목인 반복문을 돌리기 위한 테이블의 길이까지 구하는데 성공했습니다. 이제 두번째로 필요한 것은 컬럼에 넣을 각 항목들의 태그인데요.각항목의 태그를 구하는 것은 2편에서 설명한 Copy selector를 이용하면 됩니다. 

 

이제 FC 바이에른 뮌헨의 순위, 팀, 경기수, 승점, 승, 무, 패, 득점, 실점, 득실차들의 Copy selector를 가져와보겠습니다. 

rank = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(1) > div > strong')
team = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(2) > div > span.name')
game = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(3) > div > span')
win_pt = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(4) > div > span')
win = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(5) > div > span')
draw = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(6) > div > span')
lose = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(7) > div > span')
gf = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(8) > div > span')
ga = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(9) > div > span')
goal_diff = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(1) > td:nth-child(10) > div > span')

이렇게 10개를 모두 가져왔는데요. 패턴을 파악하자면 tr:nth-child(1)은 모두 같고 td:nth-child()는 1번부터 10번까지 숫자가 있습니다. 이를 두고 알 수 있는 것은 "tr:nth-child()" 부분이 위치에 따라 변경한다는 점입니다. 보루시아 도르트문트를 카피했다면 "tr:nth-child(2)~"이었을 겁니다. 그럼 반복문을 적용할 때 "tr:nth-child()" 부분에 변수를 입히면 된다는 판단을 할 수 있습니다.

 

그렇다면 반복문을 만들어볼까요.

from selenium import webdriver
import pandas as pd
    
#open webdriver
chrome_driver = './chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)

df_bundes_team = pd.DataFrame(columns = ['rank', 'team', 'game', 'win_pt', 'win', 'draw', 
                                         'lose', 'gf', 'ga', 'goal_diff'])

bundes_football = "https://sports.news.naver.com/wfootball/record/index?category=bundesliga&tab=team"
driver.get(bundes_football)
driver.implicitly_wait(3)

tr_len = len(driver.find_elements_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr'))
print(tr_len)

for i in range(1, tr_len + 1):
    rank = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(1) > div > strong').text
    team = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(2) > div > span.name').text
    game = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(3) > div > span').text
    win_pt = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(4) > div > span').text
    win = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(5) > div > span').text
    draw = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(6) > div > span').text
    lose = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(7) > div > span').text
    gf = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(8) > div > span').text
    ga = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(9) > div > span').text
    goal_diff = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(i) > td:nth-child(10) > div > span').text

반복문을 만들어봤습니다. 하지만 이 반복문은 잘못된 부분이 있습니다. 바로 "tr:nth-child(i)" 입니다. i를 넣는 건 맞지만 이는 type을 고려하지 않는 코딩입니다.  반복문을 통해 입력되는 i는 정수이지만 selector 안에 들어가 있는 것들은 문자열이기 때문인데요. 그래서 제대로 입력을 받기 위해서는 i를 문자열 타입을 변환해주고 ()에 연결해 줘야 합니다. 

from selenium import webdriver
import pandas as pd
    
#open webdriver
chrome_driver = './chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)

df_bundes_team = pd.DataFrame(columns = ['rank', 'team', 'game', 'win_pt', 'win', 'draw', 
                                         'lose', 'gf', 'ga', 'goal_diff'])

bundes_football = "https://sports.news.naver.com/wfootball/record/index?category=bundesliga&tab=team"
driver.get(bundes_football)
driver.implicitly_wait(3)

tr_len = len(driver.find_elements_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr'))
print(tr_len)

for i in range(1, tr_len + 1):
    rank = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(1) > div > strong').text
    team = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(2) > div > span.name').text
    game = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(3) > div > span').text
    win_pt = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(4) > div > span').text
    win = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(5) > div > span').text
    draw = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(6) > div > span').text
    lose = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(7) > div > span').text
    gf = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(8) > div > span').text
    ga = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(9) > div > span').text
    goal_diff = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(10) > div > span').text

 이렇게 tr:nth-child()안에 '+ str(i) +' 를 넣어 i를 문자열로 변환해주고 소괄호들과 연결해줘야 합니다. 그리고 저희는 긁어온 데이터를 df_bundes_team이라는 데이터 프레임에 넣어야 하는데요. 이럴 때는 df.append()라는 함수를 사용합니다. 각각 만들어놓은 컬럼에 맞게 반복으로 추출한 변수들을 넣어주면 됩니다. 이렇게 하면 모두 완성인데요. 완성된 코드와 결과를 보시겠습니다. 

 

from selenium import webdriver
import pandas as pd
    
#open webdriver
chrome_driver = './chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)

df_bundes_team = pd.DataFrame(columns = ['rank', 'team', 'game', 'win_pt', 'win', 'draw', 
                                         'lose', 'gf', 'ga', 'goal_diff'])

bundes_football = "https://sports.news.naver.com/wfootball/record/index?category=bundesliga&tab=team"
driver.get(bundes_football)
driver.implicitly_wait(3)

tr_len = len(driver.find_elements_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr'))
print(tr_len)

for i in range(1, tr_len + 1):
    rank = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(1) > div > strong').text
    team = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(2) > div > span.name').text
    game = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(3) > div > span').text
    win_pt = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(4) > div > span').text
    win = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(5) > div > span').text
    draw = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(6) > div > span').text
    lose = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(7) > div > span').text
    gf = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(8) > div > span').text
    ga = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(9) > div > span').text
    goal_diff = driver.find_element_by_css_selector('#wfootballTeamRecordBody > table > tbody > tr:nth-child(' + str(i) + ') > td:nth-child(10) > div > span').text
    
    print(rank, team, game, win_pt, win, draw, lose, gf, ga, goal_diff, '\n')
    
    df_bundes_team = df_bundes_team.append({'rank':rank, 'team':team, 'game':game, 'win_pt':win_pt, 'win':win, 'draw':draw, 
    					'lose':lose, 'gf':gf, 'ga':ga, 'goal_diff':goal_diff}, ignore_index=True)

print(df_bundes_team, '\n')

driver.quit()

이렇게 네이버에 있는 분데스리가 팀순위를 크롤링하는 데 성공했습니다. 

반응형