๋„ค์ด๋ฒ„ ์ด๋ฏธ์ง€ ํฌ๋กค๋งํ•˜๊ธฐ

๋™๊ธฐ

1. ์ข‹์€ ๋กœ๋ผ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋ฏธ์ง€ ํŒŒ์ผ๋“ค์ด ๋งŽ์„์ˆ˜๋ก ์ข‹๋‹ค.
2. ๊ฒฝํ—˜์ƒ ์ด๋ฏธ์ง€ ํ€„๋Ÿฌํ‹ฐ๋Š” ๊ตฌ๊ธ€๋ณด๋‹ค ๋„ค์ด๋ฒ„๊ฐ€ ๋” ์ข‹์•˜๋‹ค - ๋ฌผ๋ก  ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ํฌ๋กค๋ง ์ฝ”๋“œ๋„ ๊ณต์œ ํ•  ์˜ˆ์ •
3. ๋„ค์ด๋ฒ„๋Š” ๊ตฌ๊ธ€์— ๋น„ํ•ด ํฌ๋กค๋ง์— ๊ด€๋Œ€ํ•˜๋‹ค. ๊ทธ๋ž˜์„œ ์ฝ”๋“œ๊ฐ€ ๊ฐ„๊ฒฐํ•ด์ง„๋‹ค - ๊ตฌ๊ธ€์€ undetected_chromedriver ๋“ฑ์„ ์‚ฌ์šฉํ•ด์•ผํ•œ๋‹ค.

 

์‚ฌ์šฉ๋ฒ•

1. 2023๋…„ 6์›” 19์ผ ํ˜„์žฌ ์•„๋ž˜ ์ฝ”๋“œ๋Š” ์ž˜ ๋Œ์•„๊ฐ„๋‹ค. ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ๊ฒŒ ์ฃผ์„์„ ๋งŽ์ด ๋‹ฌ์•„๋†“์•˜๋‹ค.
2. ์…€๋ฆฌ๋‹ˆ์›€์ด๋‚˜ urllib๋“ฑ์˜ ๋ชจ๋“ˆ ์„ค์น˜๊ฐ€ ์šฐ์„ ์ด๋‹ค.
pip install selenium
3. ์‹คํ–‰ ํ›„ ํฌ๋กฌ์ฐฝ์ด ๋œจ๊ณ  ์ฐฝ์ด ์ตœ๋Œ€ํ™” ๋œ๋‹ค. 
4. ์ž๋™์œผ๋กœ ์Šคํฌ๋กค๋˜๋ฉฐ ์ด๋ฏธ์ง€ ๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜จ๋‹ค. ์ด๋•Œ๋Š” ์ฐฝ์„ ๋‚ด๋ฆฌ์ง€๋ง๊ณ  ์Šคํฌ๋กค์ด ๋๊นŒ์ง€ ๋‚ด๋ ค๊ฐ€ ๋” ์ด์ƒ ๊ฐ€์ ธ์˜ฌ ์ด๋ฏธ์ง€๊ฐ€ ์—†์„๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ค์ฃผ์ž. ์ด๋ฏธ์ง€ ๋ชฉ๋ก์„ ๋‹ค ๊ฐ€์ ธ์˜จ ํ›„ ์ž๋™์œผ๋กœ ์Šคํฌ๋กค์ด ๋งจ์œ„๋กœ ์˜ฌ๋ผ๊ฐ„๋‹ค. ์ด ๋‹ค์Œ๋ถ€ํ„ฐ๋Š” ์ฐฝ์„ ๋‚ด๋ ค๋„ ๋œ๋‹ค. 
5. ๊ฐ€์ ธ์˜ค๊ธธ ์›ํ•˜๋Š” ์ด๋ฏธ์ง€ ์ด๋ฆ„์„ ์•„๋ž˜ ์ฝ”๋“œ์˜ item_list ์•ˆ์— ๋„ฃ๋Š”๋‹ค. ์ฃผ์„ #1๋ฒˆ
6. ํด๋” ์ด๋ฆ„์„ ์ฃผ์„#2๋ฒˆ์— ๋„ฃ๋Š”๋‹ค. ์•„๋ž˜ ์ฝ”๋“œ์ฒ˜๋Ÿผ naver๋กœ ํ•  ๊ฒฝ์šฐ ์ด๋ฏธ์ง€๋Š” data\naver\ ์•ˆ์— ์ €์žฅ๋œ๋‹ค. ํ•ด๋‹น ํด๋”๋ฅผ ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด ๋†“์ง€ ์•Š์•„๋„ ์•Œ์•„์„œ ๋งŒ๋“ค๊ณ  ์ €์žฅํ•œ๋‹ค.
7. ์ƒ์„ธ ์ด๋ฏธ์ง€์˜ xpath๋ฅผ ๋„ฃ๋Š”๋‹ค. ์ฃผ์„#3๋ฒˆ. ํฌ๋กค๋ง์„ ํ•ด๋ณด๋‹ˆ xpath๊ฐ€ ์•„~์ฃผ ๊ฐ€๋” ๋ฐ”๋€”๋•Œ๊ฐ€ ์žˆ๋‹ค. ์•„๋งˆ ๊ทธ๋Œ€๋กœ  ๋†”๋‘ฌ๋„ ํฐ ๋ฌธ์ œ์—†์„ ๊ฒƒ์ด๋‹ค.

 

์ตœ์‹  ํด๋ž˜์Šค์™€ xPath๋กœ ์žฌ์„ค์ • ํ•จ - 24.07.26

'''
* ๋„ค์ด๋ฒ„ ์ด๋ฏธ์ง€ ๊ฐ€์ ธ์˜ค๊ธฐ (24.07.26)
'''

import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import urllib
import time, datetime

item_list = [ "๋„๋‹ค๋ฆฌ"] # 1๋ฒˆ
FOLDER = 'naver' # 2๋ฒˆ
IMG_XPATH = '/html/body/div[4]/div/div/div[1]/div[2]/div[1]/img'

def main():
    start = check_start() # ์‹œ๊ฐ„ ์ธก์ • ์‹œ์ž‘
    driver = webdriver.Chrome()
    
    for searchItem in item_list:
        saveDir = makeFolder(searchItem)
        
        url = makeUrl(searchItem)# ๊ฒ€์ƒ‰ํ•  url ๊ฐ€์ ธ์™€์„œ
        driver.get(url)# ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์œผ๋กœ ๊ฐ€์„œ
        maximizeWindow(driver)# ์ฐฝ์ตœ๋Œ€ํ™”
        scrollToEnd(driver)

        forbiddenCount = saveImgs(driver, saveDir, start)# ๋ชจ๋“  ์ƒ์„ธ ์ด๋ฏธ์ง€ src๋“ค์„ ๊ฐ€์ ธ์˜จ๋‹ค
        sec = check_time(start)
        print(f'์‹คํŒจ์ˆ˜{str(forbiddenCount)}, {sec}, {datetime.datetime.now().time()}')
    time.sleep(10)
    driver.quit() 

# ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ url ๋งŒ๋“ค๊ธฐ
def makeUrl(searchItem):
    url = 'https://search.naver.com/search.naver'
    params ={
        'where' : 'image',
        'sm'    : 'tab_jum',
        'query' : searchItem
    }
    url = url + '?' + urllib.parse.urlencode(params)
    return url

# ํด๋” ์ƒ์„ฑ
def makeFolder(searchItem):
    saveDir = os.path.join(os.getcwd(), 'data', f'{FOLDER}_{searchItem}')
    try:
        if not(os.path.isdir(saveDir)): # ํ•ด๋‹น ํด๋”๊ฐ€ ์—†๋‹ค๋ฉด
            os.makedirs(os.path.join(saveDir)) # ๋งŒ๋“ค์–ด๋ผ
        return saveDir
    except OSError as e:
        print(e+'ํด๋” ์ƒ์„ฑ ์‹คํŒจ')

# ์ฐฝ ์ตœ๋Œ€ํ™”
def maximizeWindow(driver):
    driver.maximize_window()

# ๋ชจ๋“  ์ด๋ฏธ์ง€ ๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜ค๊ธฐ ์œ„ํ•ด ๋ฌดํ•œ ์Šคํฌ๋กค ๋‹ค์šด
def scrollToEnd(driver):
    prev_height = driver.execute_script('return document.body.scrollHeight')
    print(f'prev_height: {prev_height}')
    
    while True:
        time.sleep(1) #๋„ค์ด๋ฒ„๋Š” sleep์—†์ด ์ด๋™ํ•  ๊ฒฝ์šฐ ๋ฌดํ•œ๋กœ๋”ฉ์— ๊ฑธ๋ฆฐ๋‹ค.
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
        time.sleep(3)
        
        cur_height = driver.execute_script('return document.body.scrollHeight')
        print(f'cur_height: {cur_height}')
        if cur_height == prev_height:
            print('๋†’์ด๊ฐ€ ๊ฐ™์•„์ง')
            break
        prev_height = cur_height
    
    # ํŽ˜์ด์ง€๋ฅผ ๋ชจ๋‘ ๋กœ๋”ฉํ•œ ํ›„์—๋Š” ์ตœ์ƒ๋‹จ์œผ๋กœ ๋‹ค์‹œ ์˜ฌ๋ผ๊ฐ€๊ธฐ
    driver.execute_script('window.scrollTo(0, 0)')

# ๋ชจ๋“  ์ด๋ฏธ์ง€๋“ค์„ ์ €์žฅํ•œ๋‹ค
def saveImgs(driver, saveDir, start):
    time.sleep(1)
    forbiddenCount = 0
    imgs = driver.find_elements(By.CSS_SELECTOR, '._fe_image_tab_content_thumbnail_image')
    
    print('imgs')
    print(imgs)
    srcList = []
    img_count = len(imgs)
    print(f'์ „์ฒด ์ด๋ฏธ์ง€์ˆ˜ : {img_count}')
    # ํ•˜๋‚˜์”ฉ ํด๋ฆญํ•ด๊ฐ€๋ฉฐ ์ €์žฅ
    for imgNum, img in enumerate(imgs): # imgNum์— ์ด๋ฏธ์ง€๋ฒˆํ˜ธ๊ฐ€ 0๋ถ€ํ„ฐ ๋“ค์–ด๊ฐ„๋‹ค
        try:
            img.click()
            time.sleep(3)
            
            # ์•„๋ž˜์˜ xPath๋Š” ์ž์ฃผ ๋ฐ”๋€Œ๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋‚˜๋จธ์ง€๋Š” ๊ณ ์ •์ธ๊ฑฐ ๊ฐ™์œผ๋‹ˆ ์ด๊ฒƒ๋งŒ ๊ฐ€๋” ํ™•์ธํ•ด์ฃผ์ž
            bigImg = driver.find_element(By.XPATH, IMG_XPATH)
            src = bigImg.get_attribute('src')
            urllib.request.urlretrieve(src, saveDir + '/' + str(imgNum) + '.jpg')
            sec = check_time(start)
            print(f'{imgNum+1}/{img_count} saved {sec}')

        except Exception as e:
            print(e)
            forbiddenCount += 1# ์ €์žฅ ์‹คํŒจํ•œ ๊ฐœ์ˆ˜. forbidden์ด๋‚˜ ํŒŒ์ผ์—๋Ÿฌ๋„ ๊ฝค ๋งŽ๋‹ค
            continue
    return forbiddenCount


# ์‹œ๊ฐ„ ์ธก์ •
def check_start():
    start_time = time.time()
    print("Start! now.." + str(start_time))
    return start_time
def check_time(start):
    end = time.time()
    during = end - start
    sec = str(datetime.timedelta(seconds=during)).split('.')[0]
    return sec
main()

 

์ด์ œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•ด๋ณด์ž. ์ž˜ ์‹คํ–‰๋  ๊ฒƒ์ด๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ ๋„ค์ด๋ฒ„๋Š” ์ตœ๋Œ€ 500๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.

+ Recent posts