dc, ํŽจ์ฝ”, ๋„ค์ดํŠธํŒ๊ฐ™์€ ์ปค๋ฎค๋‹ˆํ‹ฐ์‚ฌ์ดํŠธ๋ฅผ ํฌ๋กค๋งํ•ด์•ผ ํ• ๋•Œ๊ฐ€ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์˜ ์ปค๋ฎค๋‹ˆํ‹ฐ ์‚ฌ์ดํŠธ๋Š” ๋ช‡๋ฒˆ๋งŒ ๊ธ€์„ ๊ฐ€์ ธ์™€๋„ ip๊ฐ€ ์ฐจ๋‹จ๋œ๋‹ค. ์ด๋•Œ ํ•„์š”ํ•œ ๊ฒƒ์ด ip์šฐํšŒ๋‹ค. ๊ฐ€์žฅ ์ข‹๊ณ  ์‰ฌ์šด ๋ฐฉ๋ฒ•์€ nord vpn ๊ฐ™์€ ์œ ๋ฃŒ ์šฐํšŒํ”„๋กœ๊ทธ๋žจ์„ ์‚ฌ์šฉํ•˜๋Š”๊ฑฐ์ง€๋งŒ ๋งฅ์ด๋‚˜ ๋ฆฌ๋ˆ…์Šค ํ™˜๊ฒฝ์ด๋ผ๋ฉด ๊ณต์งœ์ธ tor๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋„ ๋Œ€์•ˆ์ด ๋  ์ˆ˜ ์žˆ๋‹ค. - ์œˆ๋„์šฐ์—์„œ๋Š” ํ•ด๋ณธ์ ์ด ์—†์–ด ๋ ์ง€ ์•ˆ๋ ์ง€ ๋ชจ๋ฅด๊ฒ ๋Š”๋ฐ ์‰ฝ์ง„ ์•Š์„ ๊ฑฐ ๊ฐ™๋‹ค.

 

1. ๋จผ์ € tor๋ฅผ ์„ค์น˜ํ•œ๋‹ค. ํ•œ์ฐธ๋™์•ˆ ์„ค์น˜ํ•œ๋‹ค.

  1. sudo apt install tor [๋ฆฌ๋ˆ…์Šค]
  2. brew install tor [๋งฅ]

2. ํ„ฐ๋ฏธ๋„์—์„œ tor๋ฅผ ์‹คํ–‰ํ•œ๋‹ค. ์ด์ œ tor๋Š” 9050ํฌํŠธ๋ฅผ ์ด์šฉํ•ด ํ†ต์‹ ํ•œ๋‹ค.

 

3. ์•„๋ž˜์˜ ์˜ˆ๋Š” ip๋ฅผ ์šฐํšŒํ•ด์„œ ํŽจ์ฝ”(fmkorea.com)์˜ ๋ฉ”์ธ ํŽ˜์ด์ง€ ๊ธ€๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜ค๋Š” ์ฝ”๋“œ๋‹ค.

- ํŒŒ์ด์–ดํญ์Šค ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์ด์šฉํ–ˆ๋‹ค.

- ์ฒ˜์Œ ์ ‘์† ํ›„ 10์ดˆ๋ฅผ ์‰ฌ๋Š” ์ด์œ ๋Š” ํŽจ์ฝ”์˜ ๊ฒฝ์šฐ ์ฒซ ์ ‘์†์‹œ redirect๋ฅผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๊ฑด ์‚ฌ์ดํŠธ๋งˆ๋‹ค ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ์— ๋งž๊ฒŒ ์ˆ˜์ •์ด ํ•„์š”ํ•˜๋‹ค.

import re, time
from selenium import webdriver
from bs4 import BeautifulSoup

HOME = 'https://fmkorea.com'
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "127.0.0.1")
profile.set_preference("network.proxy.socks_port", 9050)
profile.update_preferences()

driver = webdriver.Firefox(profile)
driver.get(HOME)
time.sleep(10)
driver.get('https://www.fmkorea.com/index.php?mid=best&page=1')
html = driver.page_source # ํŽ˜์ด์ง€ ์†Œ์Šค์ฝ”๋“œ ๊ฐ€์ ธ์˜ค๊ธฐ
soup = BeautifulSoup(html, 'html.parser')

articles = soup.select(".li_best2_pop0 ")
for articles in articles:
    voted_count = articles.select('.count')[0].text.strip() # ์ถ”์ฒœ์ˆ˜
    title = articles.select('.hotdeal_var8')[0]
    comment_count = title.find('span').text[1:].replace(']','') # ๋Œ“๊ธ€์ˆ˜
    title.find('span').decompose()# ์ž์‹ ํƒœ๊ทธ์ธ spanํƒœ๊ทธ๋ฅผ ํŒŒ๊ดดํ•œ๋‹ค
    
    href = HOME + title['href']
    title = title.text.strip() # ์ œ๋ชฉ
    title = re.sub(r"[^\uAC00-\uD7A30-9a-zA-Z\s]", "_", title) # ํŠน์ˆ˜๋ฌธ์ž _๋กœ ๋ณ€๊ฒฝ
    category = articles.select('.category > a:nth-child(1)')[0].text.strip() # ์นดํ…Œ๊ณ ๋ฆฌ
    author = articles.select('.author')[0].text[2:].strip() # ๊ธ€์“ด์ด. ์•ž์—  ์“ฐ๋ ˆ๊ธฐ ๋ฌธ์ž ์ œ๊ฑฐํ›„ ๊ฐ€์ ธ์˜ด
    date = articles.select('.regdate')[0].text.strip() # ๋‚ ์งœ
    
    output_obj = { "title": title, "href": href,"voted_count": voted_count, "comment_count": comment_count, "category": category, "author": author, "date": date }
    
    print(output_obj)
    
driver.quit()

 

4. ์‹คํ–‰ํ•ด๋ณธ๋‹ค. ์ž˜๋œ๋‹ค.

 

5. ๊ฒฐ๋ก 

- ํ•ด๋‹น ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ robot.txt๋ฅผ ์‚ดํ•€ ํ›„ ๊ทธ์— ๋งž๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ทจํ•ฉํ•ด์•ผ ํ•œ๋‹ค.

- ํ•ด๋‹น ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ์„œ๋ฒ„์— ๋ถ€ํ•˜๊ฐ€ ๊ฐ€์ง€์•Š๋„๋ก ์ถฉ๋ถ„ํžˆ sleep์„ ์ฃผ๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์•ผ ํ•œ๋‹ค. tor์—์„œ ์ œ๊ณตํ•˜๋Š” ip์—๋„ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ์šฐํšŒํ•˜๋”๋ผ๋„ ip์ฐจ๋‹จ์„ ๋‹นํ•  ์ˆ˜ ์žˆ๋‹ค.

- ์ฃผ์š” ์ปค๋ฎค๋‹ˆํ‹ฐ ์‚ฌ์ดํŠธ๋“ค์€ ๋ณด์•ˆ์ด ๊ฐ•๋ ฅํ•ด ํฌ๋กค๋ง์ด ์‰ฝ์ง€ ์•Š๋‹ค. ๊ตฌ์กฐ๊ฐ€ ์ž์ฃผ ๋ฐ”๋€Œ๊ธฐ ๋•Œ๋ฌธ์— ์œ„ ์ฝ”๋“œ๊ฐ€ ๊ฐ‘์ž๊ธฐ ์ž‘๋™ ์•ˆ๋ ์ˆ˜๋„ ์žˆ๋‹ค. ๋ฒ•์ ์ธ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ์ˆ˜๋„ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‹ˆ ๋”ฑ ํ•„์š”ํ•œ ๋งŒํผ๋งŒ ์ฒœ์ฒœํžˆ ์ทจํ•ฉํ•ด์•ผ ํ•œ๋‹ค.

* ๋ธŒ๋ผ์šฐ์ €๋ฅผ ๋„์šฐ์ง€ ์•Š๊ณ  ๋ฉ”๋ชจ๋ฆฌ์—๋งŒ selenium์„ ๋„์›Œ ํฌ๋กค๋งํ•˜๊ธฐ

  1. ๊ธฐ์กด ์ฝ”๋“œ( https://blog.himion.com/176 )์™€ ๊ฑฐ์˜ ์œ ์‚ฌํ•˜๋‹ค.
  2. headlessDriver() ํ•จ์ˆ˜๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ๋‹ค
'''
* ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ๊ฐ€์ ธ์˜ค๊ธฐ ver. Headless
'''

import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import undetected_chromedriver as uc

import urllib
import time, datetime

ITEM_LIST = [ "Keith Thompson", "Zdzislaw Beksinski", "dariusz zawadzki"] # 1๋ฒˆ
FOLDER = 'google' # 2๋ฒˆ
IMG_XPATH = '//*[@id="Sva75c"]/div[2]/div/div[2]/div[2]/div[2]/c-wiz/div/div/div/div[3]/div[1]/a/img[1]' # 3๋ฒˆ
SIGNINURL = 'https://accounts.google.com/signin/v2/identifier?hl=ko&passive=true&continue=https%3A%2F%2Fwww.google.com%2F&ec=GAZAmgQ&flowName=GlifWebSignIn&flowEntry=ServiceLogin'
ID = 'xxxx@gmail.com' # 4๋ฒˆ
PASSWORD = 'xxxx' # 5๋ฒˆ

def main():
  start = check_start() # ์‹œ๊ฐ„ ์ธก์ • ์‹œ์ž‘
  driver = headlessDriver()# headless๋ฅผ ์ ์šฉํ•˜๊ณ  ์‹ถ์„๋•Œ
  driver.get(SIGNINURL)
  googleSignIn(driver)# ๊ตฌ๊ธ€๋กœ๊ทธ์ธํ•˜๊ณ 
  
  for searchItem in ITEM_LIST:
    saveDir = makeFolder(searchItem)
    
    url = makeUrl(searchItem)# ๊ฒ€์ƒ‰ํ•  url ๊ฐ€์ ธ์™€์„œ
    driver.get(url)# ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์œผ๋กœ ๊ฐ€์„œ
    maximizeWindow(driver)# ์ฐฝ์ตœ๋Œ€ํ™”
    scrollToEnd(driver)

    forbiddenCount = saveImgs(driver, saveDir, start)# ๋ชจ๋“  ์ƒ์„ธ ์ด๋ฏธ์ง€ src๋“ค์„ ๊ฐ€์ ธ์˜จ๋‹ค
    sec = check_time(start)
    print(f'์‹คํŒจ์ˆ˜{str(forbiddenCount)}, {sec}, {datetime.datetime.now().time()}')
  time.sleep(10)
  driver.quit() 
  
def headlessDriver():
  options = uc.ChromeOptions()
  options.headless=True
  options.add_argument('--headless=new')
  driver = uc.Chrome(options=options)
  return driver

# ๊ตฌ๊ธ€ ๋กœ๊ทธ์ธ
def googleSignIn(driver):
  idBtn = driver.find_element(By.XPATH,'//*[@id="identifierId"]')# id ์ž…๋ ฅ์นธ
  idBtn.send_keys(ID)
  nextBtn = driver.find_element(By.XPATH,'//*[@id="identifierNext"]/div/button')
  nextBtn.click()# ๋‹ค์Œ ๋ฒ„ํŠผ ํด๋ฆญ

  # ์•„๋ž˜ ์ฝ”๋“œ๋Š” ๋น„๋ฐ€๋ฒˆํ˜ธ ์š”์†Œ๊ฐ€ ํ™”๋ฉด์— ๋‚˜ํƒ€๋‚ ๋•Œ๊ฐ€์ง€ 10์ดˆ๊ฐ„ ๊ธฐ๋‹ค๋ฆฌ๋Š” ์ฝ”๋“œ์ด๋‚˜
  # ๋น„๋ฒˆ์˜ ๊ฒฝ์šฐ not interactive elem๋ผ์„œ ์—๋Ÿฌ๊ฐ€ ๋œฌ๋‹ค. ํ•˜์ง€๋งŒ ๋Œ์•„๊ฐ€๋Š” ์ฝ”๋“œ์ด๋‹ˆ ๊ธฐ๋‹ค๋ฆผ์ด ํ•„์š”ํ• ๋•Œ ์“ฐ์ž.
  try:
    passwordBtn = WebDriverWait(driver, timeout=10).until(EC.presence_of_element_located( (By.XPATH,'//*[@id="password"]/div[1]/div/div[1]/input') ))
    time.sleep(4)
    passwordBtn = driver.find_element(By.XPATH,'//*[@id="password"]/div[1]/div/div[1]/input')# ๋น„๋ฐ€๋ฒˆํ˜ธ ์ž…๋ ฅ์นธ
    passwordBtn.send_keys(PASSWORD)
    passwordNextBtn = driver.find_element(By.XPATH,'//*[@id="passwordNext"]/div/button')
    passwordNextBtn.click()# ๋น„๋ฐ€๋ฒˆํ˜ธ ๋‹ค์Œ ๋ฒ„ํŠผ
    print('๊ตฌ๊ธ€ ๋กœ๊ทธ์ธ ์„ฑ๊ณต')
    # driver.implicitly_wait(10)
  except OSError as e:
    print(e)
    
  time.sleep(20)# ํœด๋Œ€ํฐ ๋ณธ์ธ ์ธ์ฆ๋“ฑ์˜ ์‹œ๊ฐ„์ด ์ถฉ๋ถ„ํžˆ ํ•„์š”ํ•˜๋‹ค


# ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ url ๋งŒ๋“ค๊ธฐ
def makeUrl(searchItem):
  url = 'https://www.google.com/search'
  params ={# q์™€ tbm์ด ํ•„์ˆ˜
    'q'     : searchItem,
    'tbm'   : 'isch',
  }
  url = url + '?' + urllib.parse.urlencode(params)
  return url


# ํด๋” ์ƒ์„ฑ
def makeFolder(searchItem):
  saveDir = os.path.join(os.getcwd(), 'data', f'{FOLDER}_{searchItem}')
  try:
    if not(os.path.isdir(saveDir)): # ํ•ด๋‹น ํด๋”๊ฐ€ ์—†๋‹ค๋ฉด
      os.makedirs(os.path.join(saveDir)) # ๋งŒ๋“ค์–ด๋ผ
    return saveDir
  except OSError as e:
    print(e+'ํด๋” ์ƒ์„ฑ ์‹คํŒจ')

# ์ฐฝ ์ตœ๋Œ€ํ™”
def maximizeWindow(driver):
  driver.maximize_window()

# ๋ชจ๋“  ์ด๋ฏธ์ง€ ๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜ค๊ธฐ ์œ„ํ•ด ๋ฌดํ•œ ์Šคํฌ๋กค ๋‹ค์šด
def scrollToEnd(driver):
  prev_height = driver.execute_script('return document.body.scrollHeight')
  print(f'prev_height: {prev_height}')
  
  while True:
    time.sleep(1) #๋„ค์ด๋ฒ„๋Š” sleep์—†์ด ์ด๋™ํ•  ๊ฒฝ์šฐ ๋ฌดํ•œ๋กœ๋”ฉ์— ๊ฑธ๋ฆฐ๋‹ค.
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
    time.sleep(3)
    
    cur_height = driver.execute_script('return document.body.scrollHeight')
    print(f'cur_height: {cur_height}')
    if cur_height == prev_height:
      print('๋†’์ด๊ฐ€ ๊ฐ™์•„์ง')
      break
    prev_height = cur_height
  # ํŽ˜์ด์ง€๋ฅผ ๋ชจ๋‘ ๋กœ๋”ฉํ•œ ํ›„์—๋Š” ์ตœ์ƒ๋‹จ์œผ๋กœ ๋‹ค์‹œ ์˜ฌ๋ผ๊ฐ€๊ธฐ
  driver.execute_script('window.scrollTo(0, 0)')

# ๋ชจ๋“  ์ด๋ฏธ์ง€๋“ค์„ ์ €์žฅํ•œ๋‹ค
def saveImgs(driver, saveDir, start):
  time.sleep(1)
  forbiddenCount = 0
  imgs = driver.find_elements(By.CSS_SELECTOR, '.rg_i.Q4LuWd')
  img_count = len(imgs)
  print(f'์ „์ฒด ์ด๋ฏธ์ง€์ˆ˜ : {img_count}')
  # ํ•˜๋‚˜์”ฉ ํด๋ฆญํ•ด๊ฐ€๋ฉฐ ์ €์žฅ
  for imgNum, img in enumerate(imgs): # imgNum์— ์ด๋ฏธ์ง€๋ฒˆํ˜ธ๊ฐ€ 0๋ถ€ํ„ฐ ๋“ค์–ด๊ฐ„๋‹ค
    try:
      img.click()
      time.sleep(3)
      
      # ์•„๋ž˜์˜ xPath๋Š” ์ž์ฃผ ๋ฐ”๋€Œ๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋‚˜๋จธ์ง€๋Š” ๊ณ ์ •์ธ๊ฑฐ ๊ฐ™์œผ๋‹ˆ ์ด๊ฒƒ๋งŒ ๊ฐ€๋” ํ™•์ธํ•ด์ฃผ์ž
      bigImg = driver.find_element(By.XPATH, IMG_XPATH)
      src = bigImg.get_attribute('src')
      urllib.request.urlretrieve(src, saveDir + '/' + str(imgNum) + '.jpg')
      sec = check_time(start)
      print(f'{imgNum+1}/{img_count} saved {sec}')

    except Exception as e:
      print(e)
      forbiddenCount += 1# ์ €์žฅ ์‹คํŒจํ•œ ๊ฐœ์ˆ˜. forbidden์ด๋‚˜ ํŒŒ์ผ์—๋Ÿฌ๋„ ๊ฝค ๋งŽ๋‹ค
      continue
  return forbiddenCount


# ์‹œ๊ฐ„ ์ธก์ •
def check_start():
    start_time = time.time()
    print("Start! now.." + str(start_time))
    return start_time
def check_time(start):
    end = time.time()
    during = end - start
    sec = str(datetime.timedelta(seconds=during)).split('.')[0]
    return sec
main()

* ๋กœ๊ทธ์ธ์„ ํ•˜๊ณ  ํฌ๋กค๋ง์„ ํ•˜๋Š” ์ด์œ 

  1. ๊ตฌ๊ธ€์˜ ๊ฒฝ์šฐ ๋กœ๊ทธ์ธํ•˜๊ณ  ๋‚˜์˜ค๋Š” ์ด๋ฏธ์ง€์™€ ๋กœ๊ทธ์ธ์„ ํ•˜์ง€ ์•Š๊ณ  ๋‚˜์˜ค๋Š” ์ด๋ฏธ์ง€ ๋ชฉ๋ก์ด ๋‹ค๋ฅผ๋•Œ๊ฐ€ ๋งŽ๋‹ค.
  2. ์„ฑ์ธ์ธ์ฆ์ด ํ•„์š”ํ•œ ์ด๋ฏธ์ง€๋“ค์€ ๋กœ๊ทธ์ธ์„ ํ•ด์•ผ๋งŒ ๊ฐ€์ ธ์˜ฌ์ˆ˜ ์žˆ๋‹ค.

 

* ์‚ฌ์šฉ๋ฒ•

  1. ์ •์ƒ์ ์œผ๋กœ ํฌ๋กค๋ง๋˜๋Š”์ง€ ํ™•์ธ์™„๋ฃŒ [23.06.20]
  2. ๋ชจ๋“ˆ ์„ค์น˜ - pip install undetected_chromedriver selenium
  3. ์ฃผ์„ 1๋ฒˆ์— ์ด๋ฏธ์ง€๋ฅผ ์›ํ•˜๋Š” ๊ฒ€์ƒ‰์–ด ๋ชฉ๋ก ์ž…๋ ฅ
  4. ์ฃผ์„ 2๋ฒˆ์— ํด๋”์ด๋ฆ„ ์ž…๋ ฅ. ์ด๋ฏธ์ง€๋Š” data\google\ ์•„๋ž˜ ์ €์žฅ๋จ
  5. ์ฃผ์„ 3๋ฒˆ์— ์ƒ์„ธ์ด๋ฏธ์ง€์˜ xPath ์ž…๋ ฅ. ๊ตฌ๊ธ€์˜ ๊ฒฝ์šฐ ์ž์ฃผ ๋ฐ”๋€Œ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.
  6. ์ฃผ์„ 4๋ฒˆ์— ๊ตฌ๊ธ€ ID์ž…๋ ฅ
  7. ์ฃผ์„ 5๋ฒˆ์— ๊ตฌ๊ธ€ ๋น„๋ฐ€๋ฒˆํ˜ธ ์ž…๋ ฅ. ์ดํ›„ ์ถ”๊ฐ€๋กœ ์Šค๋งˆํŠธํฐ ์ธ์ฆํ™”๋ฉด์ด ๋œฐ ๊ฒฝ์šฐ์— ๋Œ€๋น„ํ•ด 20์ดˆ๊ฐ„ ๊ธฐ๋‹ค๋ฆฐ๋‹ค.
'''
* ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ๊ฐ€์ ธ์˜ค๊ธฐ (23.06.20)
'''

import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import undetected_chromedriver as uc

import urllib
import time, datetime

ITEM_LIST = [ "Keith Thompson", "Zdzislaw Beksinski", "dariusz zawadzki"] # 1๋ฒˆ
FOLDER = 'google' # 2๋ฒˆ
IMG_XPATH = '//*[@id="Sva75c"]/div[2]/div/div[2]/div[2]/div[2]/c-wiz/div/div/div/div[3]/div[1]/a/img[1]' # 3๋ฒˆ
SIGNINURL = 'https://accounts.google.com/signin/v2/identifier?hl=ko&passive=true&continue=https%3A%2F%2Fwww.google.com%2F&ec=GAZAmgQ&flowName=GlifWebSignIn&flowEntry=ServiceLogin'
ID = 'xxxx@gmail.com' # 4๋ฒˆ
PASSWORD = 'xxxx' # 5๋ฒˆ

def main():
  start = check_start() # ์‹œ๊ฐ„ ์ธก์ • ์‹œ์ž‘
  driver = uc.Chrome()# ๊ตฌ๊ธ€๋กœ๊ทธ์ธ์„ ์œ„ํ•œ ๋ชจ๋“ˆ์„ ์ผœ๊ณ 
  driver.get(SIGNINURL)
  googleSignIn(driver)# ๊ตฌ๊ธ€๋กœ๊ทธ์ธํ•˜๊ณ 
  
  for searchItem in ITEM_LIST:
    saveDir = makeFolder(searchItem)
    
    url = makeUrl(searchItem)# ๊ฒ€์ƒ‰ํ•  url ๊ฐ€์ ธ์™€์„œ
    driver.get(url)# ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์œผ๋กœ ๊ฐ€์„œ
    maximizeWindow(driver)# ์ฐฝ์ตœ๋Œ€ํ™”
    scrollToEnd(driver)

    forbiddenCount = saveImgs(driver, saveDir, start)# ๋ชจ๋“  ์ƒ์„ธ ์ด๋ฏธ์ง€ src๋“ค์„ ๊ฐ€์ ธ์˜จ๋‹ค
    sec = check_time(start)
    print(f'์‹คํŒจ์ˆ˜{str(forbiddenCount)}, {sec}, {datetime.datetime.now().time()}')
  time.sleep(10)
  driver.quit() 
  
# ๊ตฌ๊ธ€ ๋กœ๊ทธ์ธ
def googleSignIn(driver):
  idBtn = driver.find_element(By.XPATH,'//*[@id="identifierId"]')# id ์ž…๋ ฅ์นธ
  idBtn.send_keys(ID)
  nextBtn = driver.find_element(By.XPATH,'//*[@id="identifierNext"]/div/button')
  nextBtn.click()# ๋‹ค์Œ ๋ฒ„ํŠผ ํด๋ฆญ

  # ์•„๋ž˜ ์ฝ”๋“œ๋Š” ๋น„๋ฐ€๋ฒˆํ˜ธ ์š”์†Œ๊ฐ€ ํ™”๋ฉด์— ๋‚˜ํƒ€๋‚ ๋•Œ๊ฐ€์ง€ 10์ดˆ๊ฐ„ ๊ธฐ๋‹ค๋ฆฌ๋Š” ์ฝ”๋“œ์ด๋‚˜
  # ๋น„๋ฒˆ์˜ ๊ฒฝ์šฐ not interactive elem๋ผ์„œ ์—๋Ÿฌ๊ฐ€ ๋œฌ๋‹ค. ํ•˜์ง€๋งŒ ๋Œ์•„๊ฐ€๋Š” ์ฝ”๋“œ์ด๋‹ˆ ๊ธฐ๋‹ค๋ฆผ์ด ํ•„์š”ํ• ๋•Œ ์“ฐ์ž.
  try:
    passwordBtn = WebDriverWait(driver, timeout=10).until(EC.presence_of_element_located( (By.XPATH,'//*[@id="password"]/div[1]/div/div[1]/input') ))
    time.sleep(4)
    passwordBtn = driver.find_element(By.XPATH,'//*[@id="password"]/div[1]/div/div[1]/input')# ๋น„๋ฐ€๋ฒˆํ˜ธ ์ž…๋ ฅ์นธ
    passwordBtn.send_keys(PASSWORD)
    passwordNextBtn = driver.find_element(By.XPATH,'//*[@id="passwordNext"]/div/button')
    passwordNextBtn.click()# ๋น„๋ฐ€๋ฒˆํ˜ธ ๋‹ค์Œ ๋ฒ„ํŠผ
    print('๊ตฌ๊ธ€ ๋กœ๊ทธ์ธ ์„ฑ๊ณต')
    # driver.implicitly_wait(10)
  except OSError as e:
    print(e)
    
  time.sleep(20)# ํœด๋Œ€ํฐ ๋ณธ์ธ ์ธ์ฆ๋“ฑ์˜ ์‹œ๊ฐ„์ด ์ถฉ๋ถ„ํžˆ ํ•„์š”ํ•˜๋‹ค


# ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ url ๋งŒ๋“ค๊ธฐ
def makeUrl(searchItem):
  url = 'https://www.google.com/search'
  params ={# q์™€ tbm์ด ํ•„์ˆ˜
    'q'     : searchItem,
    'tbm'   : 'isch',
  }
  url = url + '?' + urllib.parse.urlencode(params)
  return url


# ํด๋” ์ƒ์„ฑ
def makeFolder(searchItem):
  saveDir = os.path.join(os.getcwd(), 'data', f'{FOLDER}_{searchItem}')
  try:
    if not(os.path.isdir(saveDir)): # ํ•ด๋‹น ํด๋”๊ฐ€ ์—†๋‹ค๋ฉด
      os.makedirs(os.path.join(saveDir)) # ๋งŒ๋“ค์–ด๋ผ
    return saveDir
  except OSError as e:
    print(e+'ํด๋” ์ƒ์„ฑ ์‹คํŒจ')

# ์ฐฝ ์ตœ๋Œ€ํ™”
def maximizeWindow(driver):
  driver.maximize_window()

# ๋ชจ๋“  ์ด๋ฏธ์ง€ ๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜ค๊ธฐ ์œ„ํ•ด ๋ฌดํ•œ ์Šคํฌ๋กค ๋‹ค์šด
def scrollToEnd(driver):
  prev_height = driver.execute_script('return document.body.scrollHeight')
  print(f'prev_height: {prev_height}')
  
  while True:
    time.sleep(1) #๋„ค์ด๋ฒ„๋Š” sleep์—†์ด ์ด๋™ํ•  ๊ฒฝ์šฐ ๋ฌดํ•œ๋กœ๋”ฉ์— ๊ฑธ๋ฆฐ๋‹ค.
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
    time.sleep(3)
    
    cur_height = driver.execute_script('return document.body.scrollHeight')
    print(f'cur_height: {cur_height}')
    if cur_height == prev_height:
      print('๋†’์ด๊ฐ€ ๊ฐ™์•„์ง')
      break
    prev_height = cur_height
  # ํŽ˜์ด์ง€๋ฅผ ๋ชจ๋‘ ๋กœ๋”ฉํ•œ ํ›„์—๋Š” ์ตœ์ƒ๋‹จ์œผ๋กœ ๋‹ค์‹œ ์˜ฌ๋ผ๊ฐ€๊ธฐ
  driver.execute_script('window.scrollTo(0, 0)')

# ๋ชจ๋“  ์ด๋ฏธ์ง€๋“ค์„ ์ €์žฅํ•œ๋‹ค
def saveImgs(driver, saveDir, start):
  time.sleep(1)
  forbiddenCount = 0
  imgs = driver.find_elements(By.CSS_SELECTOR, '.rg_i.Q4LuWd')
  img_count = len(imgs)
  print(f'์ „์ฒด ์ด๋ฏธ์ง€์ˆ˜ : {img_count}')
  # ํ•˜๋‚˜์”ฉ ํด๋ฆญํ•ด๊ฐ€๋ฉฐ ์ €์žฅ
  for imgNum, img in enumerate(imgs): # imgNum์— ์ด๋ฏธ์ง€๋ฒˆํ˜ธ๊ฐ€ 0๋ถ€ํ„ฐ ๋“ค์–ด๊ฐ„๋‹ค
    try:
      img.click()
      time.sleep(3)
      
      # ์•„๋ž˜์˜ xPath๋Š” ์ž์ฃผ ๋ฐ”๋€Œ๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋‚˜๋จธ์ง€๋Š” ๊ณ ์ •์ธ๊ฑฐ ๊ฐ™์œผ๋‹ˆ ์ด๊ฒƒ๋งŒ ๊ฐ€๋” ํ™•์ธํ•ด์ฃผ์ž
      bigImg = driver.find_element(By.XPATH, IMG_XPATH)
      src = bigImg.get_attribute('src')
      urllib.request.urlretrieve(src, saveDir + '/' + str(imgNum) + '.jpg')
      sec = check_time(start)
      print(f'{imgNum+1}/{img_count} saved {sec}')

    except Exception as e:
      print(e)
      forbiddenCount += 1# ์ €์žฅ ์‹คํŒจํ•œ ๊ฐœ์ˆ˜. forbidden์ด๋‚˜ ํŒŒ์ผ์—๋Ÿฌ๋„ ๊ฝค ๋งŽ๋‹ค
      continue
  return forbiddenCount


# ์‹œ๊ฐ„ ์ธก์ •
def check_start():
    start_time = time.time()
    print("Start! now.." + str(start_time))
    return start_time
def check_time(start):
    end = time.time()
    during = end - start
    sec = str(datetime.timedelta(seconds=during)).split('.')[0]
    return sec
main()

* ์ƒ์„ธ ์ด๋ฏธ์ง€์˜ xPath ์•Œ์•„๋‚ด๋Š” ๋ฐฉ๋ฒ•

- ํฌ๋กฌ์˜ ์ด๋ฏธ์ง€ ํด๋ฆญ ํ›„ ๋œจ๋Š” ์ƒ์„ธ์ด๋ฏธ์ง€ ํ™”๋ฉด์—์„œ ํ•ด๋‹น elements์˜ xPath๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ณต์‚ฌํ•จ

 

๋‹ค์Œ์—๋Š” headless ํฌ๋กค๋ง์— ๋Œ€ํ•ด ์ •๋ฆฌํ•˜๊ฒ ๋‹ค

- headless ํฌ๋กค๋ง์€ ํ™”๋ฉด์— ๋ธŒ๋ผ์šฐ์ € ์ฐฝ์„ ๋„์šฐ์ง€ ์•Š๊ณ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ๋งŒ ์ž‘๋™ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

๋„ค์ด๋ฒ„ ์ด๋ฏธ์ง€ ํฌ๋กค๋งํ•˜๊ธฐ

๋™๊ธฐ

1. ์ข‹์€ ๋กœ๋ผ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋ฏธ์ง€ ํŒŒ์ผ๋“ค์ด ๋งŽ์„์ˆ˜๋ก ์ข‹๋‹ค.
2. ๊ฒฝํ—˜์ƒ ์ด๋ฏธ์ง€ ํ€„๋Ÿฌํ‹ฐ๋Š” ๊ตฌ๊ธ€๋ณด๋‹ค ๋„ค์ด๋ฒ„๊ฐ€ ๋” ์ข‹์•˜๋‹ค - ๋ฌผ๋ก  ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ํฌ๋กค๋ง ์ฝ”๋“œ๋„ ๊ณต์œ ํ•  ์˜ˆ์ •
3. ๋„ค์ด๋ฒ„๋Š” ๊ตฌ๊ธ€์— ๋น„ํ•ด ํฌ๋กค๋ง์— ๊ด€๋Œ€ํ•˜๋‹ค. ๊ทธ๋ž˜์„œ ์ฝ”๋“œ๊ฐ€ ๊ฐ„๊ฒฐํ•ด์ง„๋‹ค - ๊ตฌ๊ธ€์€ undetected_chromedriver ๋“ฑ์„ ์‚ฌ์šฉํ•ด์•ผํ•œ๋‹ค.

 

์‚ฌ์šฉ๋ฒ•

1. 2023๋…„ 6์›” 19์ผ ํ˜„์žฌ ์•„๋ž˜ ์ฝ”๋“œ๋Š” ์ž˜ ๋Œ์•„๊ฐ„๋‹ค. ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ๊ฒŒ ์ฃผ์„์„ ๋งŽ์ด ๋‹ฌ์•„๋†“์•˜๋‹ค.
2. ์…€๋ฆฌ๋‹ˆ์›€์ด๋‚˜ urllib๋“ฑ์˜ ๋ชจ๋“ˆ ์„ค์น˜๊ฐ€ ์šฐ์„ ์ด๋‹ค.
pip install selenium
3. ์‹คํ–‰ ํ›„ ํฌ๋กฌ์ฐฝ์ด ๋œจ๊ณ  ์ฐฝ์ด ์ตœ๋Œ€ํ™” ๋œ๋‹ค. 
4. ์ž๋™์œผ๋กœ ์Šคํฌ๋กค๋˜๋ฉฐ ์ด๋ฏธ์ง€ ๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜จ๋‹ค. ์ด๋•Œ๋Š” ์ฐฝ์„ ๋‚ด๋ฆฌ์ง€๋ง๊ณ  ์Šคํฌ๋กค์ด ๋๊นŒ์ง€ ๋‚ด๋ ค๊ฐ€ ๋” ์ด์ƒ ๊ฐ€์ ธ์˜ฌ ์ด๋ฏธ์ง€๊ฐ€ ์—†์„๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ค์ฃผ์ž. ์ด๋ฏธ์ง€ ๋ชฉ๋ก์„ ๋‹ค ๊ฐ€์ ธ์˜จ ํ›„ ์ž๋™์œผ๋กœ ์Šคํฌ๋กค์ด ๋งจ์œ„๋กœ ์˜ฌ๋ผ๊ฐ„๋‹ค. ์ด ๋‹ค์Œ๋ถ€ํ„ฐ๋Š” ์ฐฝ์„ ๋‚ด๋ ค๋„ ๋œ๋‹ค. 
5. ๊ฐ€์ ธ์˜ค๊ธธ ์›ํ•˜๋Š” ์ด๋ฏธ์ง€ ์ด๋ฆ„์„ ์•„๋ž˜ ์ฝ”๋“œ์˜ item_list ์•ˆ์— ๋„ฃ๋Š”๋‹ค. ์ฃผ์„ #1๋ฒˆ
6. ํด๋” ์ด๋ฆ„์„ ์ฃผ์„#2๋ฒˆ์— ๋„ฃ๋Š”๋‹ค. ์•„๋ž˜ ์ฝ”๋“œ์ฒ˜๋Ÿผ naver๋กœ ํ•  ๊ฒฝ์šฐ ์ด๋ฏธ์ง€๋Š” data\naver\ ์•ˆ์— ์ €์žฅ๋œ๋‹ค. ํ•ด๋‹น ํด๋”๋ฅผ ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด ๋†“์ง€ ์•Š์•„๋„ ์•Œ์•„์„œ ๋งŒ๋“ค๊ณ  ์ €์žฅํ•œ๋‹ค.
7. ์ƒ์„ธ ์ด๋ฏธ์ง€์˜ xpath๋ฅผ ๋„ฃ๋Š”๋‹ค. ์ฃผ์„#3๋ฒˆ. ํฌ๋กค๋ง์„ ํ•ด๋ณด๋‹ˆ xpath๊ฐ€ ์•„~์ฃผ ๊ฐ€๋” ๋ฐ”๋€”๋•Œ๊ฐ€ ์žˆ๋‹ค. ์•„๋งˆ ๊ทธ๋Œ€๋กœ  ๋†”๋‘ฌ๋„ ํฐ ๋ฌธ์ œ์—†์„ ๊ฒƒ์ด๋‹ค.

 

์ตœ์‹  ํด๋ž˜์Šค์™€ xPath๋กœ ์žฌ์„ค์ • ํ•จ - 24.07.26

'''
* ๋„ค์ด๋ฒ„ ์ด๋ฏธ์ง€ ๊ฐ€์ ธ์˜ค๊ธฐ (24.07.26)
'''

import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import urllib
import time, datetime

item_list = [ "๋„๋‹ค๋ฆฌ"] # 1๋ฒˆ
FOLDER = 'naver' # 2๋ฒˆ
IMG_XPATH = '/html/body/div[4]/div/div/div[1]/div[2]/div[1]/img'

def main():
    start = check_start() # ์‹œ๊ฐ„ ์ธก์ • ์‹œ์ž‘
    driver = webdriver.Chrome()
    
    for searchItem in item_list:
        saveDir = makeFolder(searchItem)
        
        url = makeUrl(searchItem)# ๊ฒ€์ƒ‰ํ•  url ๊ฐ€์ ธ์™€์„œ
        driver.get(url)# ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์œผ๋กœ ๊ฐ€์„œ
        maximizeWindow(driver)# ์ฐฝ์ตœ๋Œ€ํ™”
        scrollToEnd(driver)

        forbiddenCount = saveImgs(driver, saveDir, start)# ๋ชจ๋“  ์ƒ์„ธ ์ด๋ฏธ์ง€ src๋“ค์„ ๊ฐ€์ ธ์˜จ๋‹ค
        sec = check_time(start)
        print(f'์‹คํŒจ์ˆ˜{str(forbiddenCount)}, {sec}, {datetime.datetime.now().time()}')
    time.sleep(10)
    driver.quit() 

# ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ url ๋งŒ๋“ค๊ธฐ
def makeUrl(searchItem):
    url = 'https://search.naver.com/search.naver'
    params ={
        'where' : 'image',
        'sm'    : 'tab_jum',
        'query' : searchItem
    }
    url = url + '?' + urllib.parse.urlencode(params)
    return url

# ํด๋” ์ƒ์„ฑ
def makeFolder(searchItem):
    saveDir = os.path.join(os.getcwd(), 'data', f'{FOLDER}_{searchItem}')
    try:
        if not(os.path.isdir(saveDir)): # ํ•ด๋‹น ํด๋”๊ฐ€ ์—†๋‹ค๋ฉด
            os.makedirs(os.path.join(saveDir)) # ๋งŒ๋“ค์–ด๋ผ
        return saveDir
    except OSError as e:
        print(e+'ํด๋” ์ƒ์„ฑ ์‹คํŒจ')

# ์ฐฝ ์ตœ๋Œ€ํ™”
def maximizeWindow(driver):
    driver.maximize_window()

# ๋ชจ๋“  ์ด๋ฏธ์ง€ ๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜ค๊ธฐ ์œ„ํ•ด ๋ฌดํ•œ ์Šคํฌ๋กค ๋‹ค์šด
def scrollToEnd(driver):
    prev_height = driver.execute_script('return document.body.scrollHeight')
    print(f'prev_height: {prev_height}')
    
    while True:
        time.sleep(1) #๋„ค์ด๋ฒ„๋Š” sleep์—†์ด ์ด๋™ํ•  ๊ฒฝ์šฐ ๋ฌดํ•œ๋กœ๋”ฉ์— ๊ฑธ๋ฆฐ๋‹ค.
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
        time.sleep(3)
        
        cur_height = driver.execute_script('return document.body.scrollHeight')
        print(f'cur_height: {cur_height}')
        if cur_height == prev_height:
            print('๋†’์ด๊ฐ€ ๊ฐ™์•„์ง')
            break
        prev_height = cur_height
    
    # ํŽ˜์ด์ง€๋ฅผ ๋ชจ๋‘ ๋กœ๋”ฉํ•œ ํ›„์—๋Š” ์ตœ์ƒ๋‹จ์œผ๋กœ ๋‹ค์‹œ ์˜ฌ๋ผ๊ฐ€๊ธฐ
    driver.execute_script('window.scrollTo(0, 0)')

# ๋ชจ๋“  ์ด๋ฏธ์ง€๋“ค์„ ์ €์žฅํ•œ๋‹ค
def saveImgs(driver, saveDir, start):
    time.sleep(1)
    forbiddenCount = 0
    imgs = driver.find_elements(By.CSS_SELECTOR, '._fe_image_tab_content_thumbnail_image')
    
    print('imgs')
    print(imgs)
    srcList = []
    img_count = len(imgs)
    print(f'์ „์ฒด ์ด๋ฏธ์ง€์ˆ˜ : {img_count}')
    # ํ•˜๋‚˜์”ฉ ํด๋ฆญํ•ด๊ฐ€๋ฉฐ ์ €์žฅ
    for imgNum, img in enumerate(imgs): # imgNum์— ์ด๋ฏธ์ง€๋ฒˆํ˜ธ๊ฐ€ 0๋ถ€ํ„ฐ ๋“ค์–ด๊ฐ„๋‹ค
        try:
            img.click()
            time.sleep(3)
            
            # ์•„๋ž˜์˜ xPath๋Š” ์ž์ฃผ ๋ฐ”๋€Œ๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋‚˜๋จธ์ง€๋Š” ๊ณ ์ •์ธ๊ฑฐ ๊ฐ™์œผ๋‹ˆ ์ด๊ฒƒ๋งŒ ๊ฐ€๋” ํ™•์ธํ•ด์ฃผ์ž
            bigImg = driver.find_element(By.XPATH, IMG_XPATH)
            src = bigImg.get_attribute('src')
            urllib.request.urlretrieve(src, saveDir + '/' + str(imgNum) + '.jpg')
            sec = check_time(start)
            print(f'{imgNum+1}/{img_count} saved {sec}')

        except Exception as e:
            print(e)
            forbiddenCount += 1# ์ €์žฅ ์‹คํŒจํ•œ ๊ฐœ์ˆ˜. forbidden์ด๋‚˜ ํŒŒ์ผ์—๋Ÿฌ๋„ ๊ฝค ๋งŽ๋‹ค
            continue
    return forbiddenCount


# ์‹œ๊ฐ„ ์ธก์ •
def check_start():
    start_time = time.time()
    print("Start! now.." + str(start_time))
    return start_time
def check_time(start):
    end = time.time()
    during = end - start
    sec = str(datetime.timedelta(seconds=during)).split('.')[0]
    return sec
main()

 

์ด์ œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•ด๋ณด์ž. ์ž˜ ์‹คํ–‰๋  ๊ฒƒ์ด๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ ๋„ค์ด๋ฒ„๋Š” ์ตœ๋Œ€ 500๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.

ai ์—๊ฒŒ ๋จน์ผ ๋ฐ์ดํ„ฐ๋ฅผ ์š”๋ฆฌ์ค‘์ด๋‹ค.

๊ฐ€์ง€๊ณ  ์žˆ๋Š” epub๋“ค์ด ์กฐ๊ธˆ ์žˆ๋Š”๋ฐ ์ด๋Œ€๋กœ๋Š” ๋จน์ผ ์ˆ˜ ์—†์œผ๋‹ˆ ๋ชจ๋‘ text๋กœ ๋ฐ”๊ฟ”๋†”์•ผํ•œ๋‹ค.


๊ทธ๋Ÿฐ๋ฐ ์ƒ๊ฐ๋ณด๋‹ค ์ž๋ฃŒ๊ฐ€ ์—†์—ˆ๋‹ค.

ํŠนํžˆ ํ•œ๊ธ€๋“ค์ด ๋ชจ๋‘ ๊นจ์ ธ๋‚˜์™”๋‹ค.


calibre๋ฅผ ์ถ”์ฒœํ•˜๊ธฐ์— ์„ค์น˜ํ›„ convert ํ•ด๋ดค๋”๋‹ˆ ์ถœ๋ ฅ ํด๋”๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์—†์–ด ์ด๊ฒƒ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฝค๋‚˜ ๊ท€์ฐฎ์€ ์ž‘์—…์ด์—ˆ๋‹ค - ํ•˜์ง€๋งŒ ํŒŒ์ด์ฌ์œผ๋กœ ๋๋‚ด ์‹คํŒจํ•œ๋‹ค๋ฉด ์ด๋ ‡๊ฒŒ๋ผ๋„ ์ž‘์—…ํ•œ ํ›„ txtํŒŒ์ผ๋“ค์„ ๋ชจ๋‘ ์ฐพ์•„ ํ•œ๋ฒˆ์— ๋ชจ์œผ๋Š” ์ฝ”๋“œ๋ฅผ ๋งŒ๋“ค ์ž‘์ •์ด์—ˆ๋‹ค.. ์ง€๋งŒ ํŒŒ์ด์ฌ์œผ๋กœ ํ•ด๊ฒฐํ–ˆ๋‹ค.

๋จผ์ € EbookLib๋ฅผ ์„ค์น˜ํ•œ๋‹ค

 

1. pip install EbookLib
https://pypi.org/project/EbookLib/

 

EbookLib

Ebook library which can handle EPUB2/EPUB3 and Kindle format

pypi.org

 

- doc ๋„ ์‚ดํŽด๋ณด์ž. ๊ทผ๋ฐ ์ข€ ๋ถ€์กฑํ•˜๋‹ค

https://docs.sourcefabric.org/projects/ebooklib/en/latest/tutorial.html#introduction

 

Tutorial — EbookLib 0.17 documentation

Creating EPUB from ebooklib import epub book = epub.EpubBook() EPUB has some minimal metadata requirements which you need to fulfil. You need to define unique identifier, title of the book and language used inside. When it comes to language code recommende

docs.sourcefabric.org

2. This is codes

import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup

book = epub.read_epub('./tear.epub')
result = book.get_metadata('DC', 'language') # ์–ด๋–ค ์–ธ์–ด๋กœ ๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค

for idx, doc in enumerate(book.get_items_of_type(ebooklib.ITEM_DOCUMENT)):
    book = doc.content
    soup = BeautifulSoup(book, 'html.parser')
    soup = soup.select('p')
    for pTag in soup:
        print(pTag.text)

3. ์•Œ๋งž์€ ์˜ต์…˜์„ ์ฐพ๋Š”๋ฐ ๊ณ ์ƒ์„ ์ข€ ํ–ˆ๋‹ค.

- ํ•œ๊ธ€ ๋ฌธ์ œ์˜ ๊ฒฝ์šฐ stackoverflow๋Š” ํฐ ๋„์›€์ด ๋˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ•œ๊ธ€ ๋ฌธ์ œ๋„ ์•„๋‹ˆ์—ˆ๋‹ค.

- idx๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด enumerate๋ฅผ ์ผ์ง€๋งŒ ์ด ๊ฒฝ์šฐ ๋ฒ”์šฉ์„ฑ์ด ๋–จ์–ด์ ธ ๊ทธ๋ƒฅ doc๋งŒ ์‚ฌ์šฉํ•œ๋‹ค. ์ฑ…์˜ ํŠน์ • ์ •๋ณด๋งŒ ๋ฝ‘์•„์„œ ์ฒ˜๋ฆฌํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์“ฐ๋Š” ๊ฒƒ๋„ ์ข‹๊ฒ ๋‹ค.


4. ์ฑ… ์ œ๋ชฉ์€ book.get_metadata('DC', 'title') ์ด๋ ‡๊ฒŒ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค.

- ์ด๊ฑธ๋กœ ํŒŒ์ผ๋ช…์„ ๋งŒ๋“ค์–ด ํด๋” ํ•˜๋‚˜์— ์ €์žฅํ•˜๋ฉด ๋ชจ๋“  epubํŒŒ์ผ์„ ํ•œ๋ฒˆ์— txt๋กœ ๋งŒ๋“ค์ˆ˜ ์žˆ๊ฒ ๋‹ค.

 

โ— ๊ธฐ์ƒ์ฒญ ์ง€์ง„์ •๋ณด ํฌ๋กค๋ง

 

> ๋“ค์–ด๊ฐ€๋Š” ๋ง

1. ์ตœ๊ทผ ์‹œ๊ฐํ™” ์Šคํ„ฐ๋””๋ฅผ ์‹œ์ž‘ํ–ˆ๋‹ค.

2. ๊ทธ๋ƒฅ R&D ํ•˜๋Š” ๊ฑด ์žฌ๋ฏธ์—†๊ณ  ๋ฌด์–ธ๊ฐ€ ์˜๋ฏธ์žˆ๋Š” ๊ฑธ ๋งŒ๋“ค์–ด ๋ณด๊ณ  ์‹ถ์—ˆ๋‹ค.

3. ๊ทธ๋Ÿฐ๋ฐ ์ด๋ฒˆ ํฌํ•ญ ์ง€์ง„์„ ๋ณด๋ฉฐ ๊ถ๊ธˆํ•œ ๊ฒƒ์ด ์žˆ์—ˆ๋‹ค.

4. '๋งŒ์•ฝ ํฌํ•ญ์— ์ง€์ง„ ๊ฐ•๋„ 7์ด ๋ฐœ์ƒํ–ˆ๋‹ค๋ฉด ์„œ์šธ์—์„œ ์–ผ๋งˆ์˜ ๊ฐ•๋„๊ฐ€ ์ „ํ•ด์งˆ๊นŒ?' ์˜€๋‹ค.

5. ์ด๊ฑธ ์‹œ๊ฐํ™”ํ•ด๋ณด๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ๋‹ค.

 

> ์ •๋ณด ์ œ๊ณต ์‚ฌ์ดํŠธ

1. ๊ธฐ์ƒ์ฒญ 

- ์•„์‰ฌ์šด ๊ฑด ๊นŠ์ด ์ •๋ณด๊ฐ€ 2017๋…„ 7์›” 5์ผ๋ถ€ํ„ฐ ์ œ๊ณต๋œ๋‹ค๋Š” ๊ฑฐ๋‹ค.

- ๋ฉ”์ผ๋กœ ๋ฌธ์˜ํ•ด๋ณด๋‹ˆ ์ด์ „ ๋ฐ์ดํ„ฐ๋Š” ์ œ๊ณตํ•ด์ค„ ์ˆ˜ ์—†๋‹จ๋‹ค. ์—ญ์‹œ ์ด์œ ๋Š” ์—†๋‹ค. ์—ญ์‹œ ์ตœ๊ณ ์˜ ์ง์—…์€ ๊ณต๋ฌด์›์ด๋‹ค. 

- http://www.kma.go.kr/weather/earthquake_volcano/domesticlist.jsp

 

2. ์ง€์ง„์—ฐ๊ตฌ์„ผํ„ฐ - ์ตœ๊ทผ์— ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋“ค์ด ์‚ญ์ œ ๋˜์—ˆ๋‹ค. ๋ฌด์Šจ ์—ฐ์œ ์ธ์ง€๋Š” ๋ชจ๋ฅด๊ฒ ๋‹ค. ๊ทธ๋ž˜์„œ ๋ฌด์“ธ๋ชจ.

 

> ์‚ฌ์ดํŠธ ๋ถ„์„

1. 1978๋…„๋ถ€ํ„ฐ ์ง€์ง„ ์ •๋ณด๊ฐ€ ์ œ๊ณต๋œ๋‹ค. ์ด๊ฑด ์ฐธ ์ข‹๋‹ค.

 

 

2. ํ•œ๋ฒˆ์— 999๊ฐœ ๋ฐ–์— ๊ฒ€์ƒ‰์ด ์•ˆ๋œ๋‹ค. ๊ทธ๋ž˜์„œ 2012๋…„์„ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ ์„œ ์ด์ „, ์ดํ›„๋กœ ๋”ฐ๋กœ ์ฟผ๋ฆฌ๋ฅผ ๋‚ ๋ ค์•ผ ํ•œ๋‹ค.

3. ๊ต‰์žฅํžˆ ์‰ฌ์šด ํฌ๋กค๋ง์ด๋‹ค. ๋‹ค๋งŒ ์ธ์ฝ”๋”ฉ์ด euc-kr์ด๋ผ๋Š” ์น ๋“์ด๋กœ ๋˜์–ด ์žˆ์–ด node์—์„œ ๊ทธ๋ƒฅ ๊ฐ€์ ธ์˜ค๋ฉด ๊นจ์ง„๋‹ค.

4. ๋‹คํ–‰ํžˆ iconv ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•ด utf8๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

5. for๋ฌธ๊ณผ request ๋ชจ๋“ˆ์€ ์ „ํ˜€ ์–ด์šธ๋ฆฌ์ง€ ์•Š๋Š”๋‹ค. ๋น„๋™๊ธฐ ๋ฌธ์ œ๋‹ค.

6. 4๋ฒˆ๊ณผ 5๋ฒˆ์˜ ๋ฌธ์ œ๊ฐ€ ํ•จ๊ป˜ ๊ผฌ์—ฌ ์•ฝ๊ฐ„์˜ ๊ณ ์ƒ์„ ํ–ˆ๋‹ค.

7. ์˜ค๋žœ๋งŒ์— ์ฝ”๋”ฉ์„ ํ–ˆ๋”๋‹ˆ ๋ฒ„๋ฒ…๊ฑฐ๋ฆฐ๊ฒƒ๋„ ์žˆ๋‹ค.

 

 

8. ์–ด์จŒ๋“  ๋๋ƒˆ๋‹ค. ๋Œ€์ถฉ ๋งŒ๋“ค์—ˆ๋”๋‹ˆ ์ €๋ ‡๋‹ค. ์ด 1663๊ฐœ์˜ ์ง€์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™”๋‹ค. sleep()์„ ์•ˆํ•˜๊ณ  ๊ทธ๋ƒฅ ๊ฐ€์ ธ์™”๋”๋‹ˆ 1๋ถ„ ์ •๋„ ๋ฐ–์— ์•ˆ๊ฑธ๋ ธ๋‹ค. ๋ฏธ์•ˆ, ๊ธฐ์ƒ์ฒญ.

 

9. ์ด์ œ ์ด ์œ„๋„, ๊ฒฝ๋„ ์ •๋ณด๋ฅผ ์ผ๋‹จ ์ง€๋„ ์œ„์— ํ‘œ์‹œํ•ด๋ณด๋ คํ•œ๋‹ค. ๊ฐ€์žฅ ํฐ ๊ณ ๋ฏผ์€ ๊ตฌ๊ธ€์„ ์“ธ์ง€, ๋‹ค์Œ์ด๋‚˜ ๋„ค์ด๋ฒ„ ์ง€๋„๋ฅผ ์“ธ์ง€๋‹ค. ์ง€๋‚œ API๊ฒฝํ—˜์œผ๋กœ ๋ฏธ๋ฃจ์–ด ๊ตฌ๊ธ€ ์ง€๋„๊ฐ€ ์šฐ๋ฆฌ๋‚˜๋ผ ์ง€๋„์— ๋Œ€ํ•ด์„œ๋Š” ๋งŽ์ด ๋ถ€์‹คํ–ˆ๊ธฐ์— ์•ˆ์“ธ๊ฑฐ ๊ฐ™๋‹ค. ์•„๋งˆ ๋‹ค์Œ ์ง€๋„๋ฅผ ์“ธ๊ฑฐ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๋‹ค์Œ์ด๋‚˜ ๋„ค์ด๋ฒ„ ์ง€๋„๋„ SVG์ง€์›์ด ์ž˜๋˜๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค. ๋‚ด์ผ ๋ช‡๊ฐ€์ง€ ๋” ํ™•์ธํ•ด ๋ณธ ํ›„ ๊ฒฐ์ •ํ•  ๊ฑฐ๋‹ค.

 

10. ๋‹จ์ˆœํ•˜๊ฒŒ ์ง€๋„์œ„์— ์ง€์ง„์„ ํ‘œ์‹œํ•˜๋Š” ๊ฑด ์ด๋ฏธ ๋งŽ์ด ์žˆ๋‹ค. ๋ณด๋ฉด ์•„๋ฌด ์˜๋ฏธ๊ฐ€ ์—†๋‹ค๋Š” ๊ฑธ ๊ธˆ๋ฐฉ ์•Œ์ˆ˜ ์žˆ๋‹ค. 

11. ๋‚œ '๊นŠ์ด'์™€ '๊ทœ๋ชจ'๋ฅผ ์ด์šฉํ•ด ๋‹จ์ˆœ ์ ์ด ์•„๋‹Œ ์˜์—ญ๊ณผ ๊ทธ๋ผ๋ฐ์ด์…˜์„ ๋งŒ๋“ค์–ด๋ณด๋ ค ํ•œ๋‹ค. ๋˜ ํƒ€์ž„๋žฉ์Šค๋ฅผ ์ด์šฉํ•ด ์‹œ๊ฐ„๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์ง€์ง„ ์˜์—ญ์˜ ๋ณ€ํ™”๋ฅผ ๋ณผ ์ˆ˜๋„ ์žˆ๊ฒ ๋‹ค.

12. ์˜ค๋ฒ„์ผ ์ˆ˜ ์žˆ๊ฒ ์œผ๋‚˜ ์ง€๋„ ์œ„์— ํŠน์ • ๊ฐ•๋„์˜ ์ง€์ง„์„ ์ผ์œผํ‚ค๊ฒŒ ๋งŒ๋“ค์–ด ํ•ด๋‹น ์ง€์ง„์˜ ํŒŒ๊ธ‰ ์˜์—ญ์„ ๋ณด์—ฌ์ค„ ์ˆ˜๋„ ์žˆ๊ฒ ๋‹ค.

13. ๊ทธ ์œ„์— 5์ธต, 10์ธต์งœ๋ฆฌ ๊ฐ€์ƒ์˜ ๊ฑด๋ฌผ์„ ๋งŒ๋“ค๊ณ  ํ•ด๋‹น ๊ฐ•๋„์˜ ๊ฒฝ์šฐ ๊ฑด๋ฌผ์˜ ๋ถ•๊ดด์ •๋„๋ฅผ ๋ณด์—ฌ์ค„ ์ˆ˜๋„ ์žˆ๊ฒ ๋‹ค.

14. ๋ญ ๋‚ด๊ฐ€ ์ง€์ง„์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•„๋Š” ๊ฒƒ๋„ ์•„๋‹ˆ๊ณ  ๋ฐ์ดํ„ฐ์— pํŒŒ, sํŒŒ๊ฐ€ ๋‚˜์™€์žˆ๋Š” ๊ฒƒ๋„ ์•„๋‹ˆ๊ณ  ๊ฒŒ๋‹ค๊ฐ€ ๊นŠ์ด๋„ ์—†์œผ๋‹ˆ '๋Œ€~์ถฉ' ๋Ÿฌํ”„ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

15. ์ผ๋‹จ, ์žฌ๋ฐŒ์„ ๊ฒƒ ๊ฐ™๋‹ค. ๊ฒฐ๊ณผ๊ฐ€ ๋ฌด์ฒ™ ๊ถ๊ธˆํ•˜๊ธฐ๋„ ํ•˜๋‹ค. 

16. ์ด๊ฒŒ ์‹œ๊ฐํ™”์˜ ๋งค๋ ฅ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

17. ๊ทธ๋‚˜์ €๋‚˜ SVG๋„ ์ž˜ ๋ชจ๋ฅด๋Š”๋ฐ ใ…‹ใ…‹ใ…‹ ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค์ง€ ์ด๋ก  ๊ณต๋ถ€๋„ ์ข€ ํ•ด์•ผํ•œ๋‹ค.

18. ์ž์ž. ์ƒˆ๋ฒฝ 6์‹œ๋‹ค. ใ…œ

 

 

 

+ Recent posts