dc, ํŽจ์ฝ”, ๋„ค์ดํŠธํŒ๊ฐ™์€ ์ปค๋ฎค๋‹ˆํ‹ฐ์‚ฌ์ดํŠธ๋ฅผ ํฌ๋กค๋งํ•ด์•ผ ํ• ๋•Œ๊ฐ€ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์˜ ์ปค๋ฎค๋‹ˆํ‹ฐ ์‚ฌ์ดํŠธ๋Š” ๋ช‡๋ฒˆ๋งŒ ๊ธ€์„ ๊ฐ€์ ธ์™€๋„ ip๊ฐ€ ์ฐจ๋‹จ๋œ๋‹ค. ์ด๋•Œ ํ•„์š”ํ•œ ๊ฒƒ์ด ip์šฐํšŒ๋‹ค. ๊ฐ€์žฅ ์ข‹๊ณ  ์‰ฌ์šด ๋ฐฉ๋ฒ•์€ nord vpn ๊ฐ™์€ ์œ ๋ฃŒ ์šฐํšŒํ”„๋กœ๊ทธ๋žจ์„ ์‚ฌ์šฉํ•˜๋Š”๊ฑฐ์ง€๋งŒ ๋งฅ์ด๋‚˜ ๋ฆฌ๋ˆ…์Šค ํ™˜๊ฒฝ์ด๋ผ๋ฉด ๊ณต์งœ์ธ tor๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋„ ๋Œ€์•ˆ์ด ๋  ์ˆ˜ ์žˆ๋‹ค. - ์œˆ๋„์šฐ์—์„œ๋Š” ํ•ด๋ณธ์ ์ด ์—†์–ด ๋ ์ง€ ์•ˆ๋ ์ง€ ๋ชจ๋ฅด๊ฒ ๋Š”๋ฐ ์‰ฝ์ง„ ์•Š์„ ๊ฑฐ ๊ฐ™๋‹ค.

 

1. ๋จผ์ € tor๋ฅผ ์„ค์น˜ํ•œ๋‹ค. ํ•œ์ฐธ๋™์•ˆ ์„ค์น˜ํ•œ๋‹ค.

  1. sudo apt install tor [๋ฆฌ๋ˆ…์Šค]
  2. brew install tor [๋งฅ]

2. ํ„ฐ๋ฏธ๋„์—์„œ tor๋ฅผ ์‹คํ–‰ํ•œ๋‹ค. ์ด์ œ tor๋Š” 9050ํฌํŠธ๋ฅผ ์ด์šฉํ•ด ํ†ต์‹ ํ•œ๋‹ค.

 

3. ์•„๋ž˜์˜ ์˜ˆ๋Š” ip๋ฅผ ์šฐํšŒํ•ด์„œ ํŽจ์ฝ”(fmkorea.com)์˜ ๋ฉ”์ธ ํŽ˜์ด์ง€ ๊ธ€๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜ค๋Š” ์ฝ”๋“œ๋‹ค.

- ํŒŒ์ด์–ดํญ์Šค ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์ด์šฉํ–ˆ๋‹ค.

- ์ฒ˜์Œ ์ ‘์† ํ›„ 10์ดˆ๋ฅผ ์‰ฌ๋Š” ์ด์œ ๋Š” ํŽจ์ฝ”์˜ ๊ฒฝ์šฐ ์ฒซ ์ ‘์†์‹œ redirect๋ฅผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๊ฑด ์‚ฌ์ดํŠธ๋งˆ๋‹ค ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ์— ๋งž๊ฒŒ ์ˆ˜์ •์ด ํ•„์š”ํ•˜๋‹ค.

import re, time
from selenium import webdriver
from bs4 import BeautifulSoup

HOME = 'https://fmkorea.com'
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "127.0.0.1")
profile.set_preference("network.proxy.socks_port", 9050)
profile.update_preferences()

driver = webdriver.Firefox(profile)
driver.get(HOME)
time.sleep(10)
driver.get('https://www.fmkorea.com/index.php?mid=best&page=1')
html = driver.page_source # ํŽ˜์ด์ง€ ์†Œ์Šค์ฝ”๋“œ ๊ฐ€์ ธ์˜ค๊ธฐ
soup = BeautifulSoup(html, 'html.parser')

articles = soup.select(".li_best2_pop0 ")
for articles in articles:
    voted_count = articles.select('.count')[0].text.strip() # ์ถ”์ฒœ์ˆ˜
    title = articles.select('.hotdeal_var8')[0]
    comment_count = title.find('span').text[1:].replace(']','') # ๋Œ“๊ธ€์ˆ˜
    title.find('span').decompose()# ์ž์‹ ํƒœ๊ทธ์ธ spanํƒœ๊ทธ๋ฅผ ํŒŒ๊ดดํ•œ๋‹ค
    
    href = HOME + title['href']
    title = title.text.strip() # ์ œ๋ชฉ
    title = re.sub(r"[^\uAC00-\uD7A30-9a-zA-Z\s]", "_", title) # ํŠน์ˆ˜๋ฌธ์ž _๋กœ ๋ณ€๊ฒฝ
    category = articles.select('.category > a:nth-child(1)')[0].text.strip() # ์นดํ…Œ๊ณ ๋ฆฌ
    author = articles.select('.author')[0].text[2:].strip() # ๊ธ€์“ด์ด. ์•ž์—  ์“ฐ๋ ˆ๊ธฐ ๋ฌธ์ž ์ œ๊ฑฐํ›„ ๊ฐ€์ ธ์˜ด
    date = articles.select('.regdate')[0].text.strip() # ๋‚ ์งœ
    
    output_obj = { "title": title, "href": href,"voted_count": voted_count, "comment_count": comment_count, "category": category, "author": author, "date": date }
    
    print(output_obj)
    
driver.quit()

 

4. ์‹คํ–‰ํ•ด๋ณธ๋‹ค. ์ž˜๋œ๋‹ค.

 

5. ๊ฒฐ๋ก 

- ํ•ด๋‹น ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ robot.txt๋ฅผ ์‚ดํ•€ ํ›„ ๊ทธ์— ๋งž๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ทจํ•ฉํ•ด์•ผ ํ•œ๋‹ค.

- ํ•ด๋‹น ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ์„œ๋ฒ„์— ๋ถ€ํ•˜๊ฐ€ ๊ฐ€์ง€์•Š๋„๋ก ์ถฉ๋ถ„ํžˆ sleep์„ ์ฃผ๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์•ผ ํ•œ๋‹ค. tor์—์„œ ์ œ๊ณตํ•˜๋Š” ip์—๋„ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ์šฐํšŒํ•˜๋”๋ผ๋„ ip์ฐจ๋‹จ์„ ๋‹นํ•  ์ˆ˜ ์žˆ๋‹ค.

- ์ฃผ์š” ์ปค๋ฎค๋‹ˆํ‹ฐ ์‚ฌ์ดํŠธ๋“ค์€ ๋ณด์•ˆ์ด ๊ฐ•๋ ฅํ•ด ํฌ๋กค๋ง์ด ์‰ฝ์ง€ ์•Š๋‹ค. ๊ตฌ์กฐ๊ฐ€ ์ž์ฃผ ๋ฐ”๋€Œ๊ธฐ ๋•Œ๋ฌธ์— ์œ„ ์ฝ”๋“œ๊ฐ€ ๊ฐ‘์ž๊ธฐ ์ž‘๋™ ์•ˆ๋ ์ˆ˜๋„ ์žˆ๋‹ค. ๋ฒ•์ ์ธ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ์ˆ˜๋„ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‹ˆ ๋”ฑ ํ•„์š”ํ•œ ๋งŒํผ๋งŒ ์ฒœ์ฒœํžˆ ์ทจํ•ฉํ•ด์•ผ ํ•œ๋‹ค.

+ Recent posts