크롤링코드 최적화

프로젝트

크롤링코드 최적화

content0474 2025. 1. 15. 20:51

CNN 언론사 크롤링 자체는 잘 되었으나, 시간이 오래걸리는 문제

(참고:

NYT api 통해 기사 얻을 때 걸린 시간: 107초,

CNN 크롤링 통해 기사 얻을 때 걸린 시간: 1775초 )

기존코드

def scrape_cnn_news_with_selenium(category_url):
    # Selenium WebDriver 설정
    options = webdriver.ChromeOptions()
    options.headless = False 
    options.add_argument("--disable-gpu")
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    
    try:
        driver.get(category_url)
        
        # 기사 요소 대기
        WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.CLASS_NAME, "container__headline-text"))
        )
        
        # BeautifulSoup으로 HTML 파싱
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        articles = []

        # 기사 제목과 링크 추출
        for article in soup.find_all('span', class_='container__headline-text'):
            title = article.get_text().strip()  # 기사 제목
            link_element = article.find_parent("a")  # 상위 <a> 태그 찾기
            if link_element and "href" in link_element.attrs:
                link = link_element['href']
                if not link.startswith("http"):  # 상대 경로 처리
                    link = f"https://edition.cnn.com{link}"

                # 기사 본문 전문 추출
                content = extract_article_content(driver, link)

                articles.append({"title": title, "link": link, "content": content})

        return articles

    except Exception as e:
        print(f"Error occurred while scraping {category_url}: {e}")
        return []
    
    finally:
        driver.quit()  # 브라우저 종료

전체 기사를 크롤링한 뒤

@shared_task
def fetch_and_store_cnn_news():
    categories = Category.objects.all()
    if not categories.exists():
        print("No categories found in the database.")
        return

    for category in categories:
        source_category = category.get_source_category(category.name, "CNN")
        if not source_category:
            print(f"No mapping found for category '{category.name}' in source 'CNN'.")
            continue

        category_url = f"https://edition.cnn.com/{source_category}"
        print(f"Fetching articles for category: {category.name} (URL: {category_url})")

        articles = scrape_cnn_news_with_selenium(category_url)

        if articles:
            for article in articles[:5]:  
                redis_key = f"news:{category.name.lower()}:{article['url']}"
                redis_value = json.dumps({
                    'title': article['title'],
                    'abstract': article['content'],  # CNN 기사 전문을 abstract로 저장
                    'url': article['url'],
                    'published_date': time.strftime('%Y-%m-%d %H:%M:%S'),  
                    'category': category.name
                })
                redis_client.set(redis_key, redis_value, ex=86400)  

                # PostgreSQL 저장
                News.objects.update_or_create(
                    url=article['url'],
                    defaults={
                        'title': article['title'],
                        'abstract': article['content'],  # CNN 기사 전문
                        'published_date': time.strftime('%Y-%m-%d %H:%M:%S'),
                        'category': category,
                    }
                )
            print(f"Saved {len(articles)} articles for category: {category.name}")
        else:
            print(f"No articles found for category: {category.name}")

그 중 상위 5개의 기사를 저장하는 로직

→ 불필요하게 모든 기사를 크롤링하느라 시간이 너무 오래 걸림

수정코드

               for article in soup.find_all('span', class_='container__headline-text')[:5]:
                    title = article.get_text().strip()
                    link_element = article.find_parent("a")
                    if link_element and "href" in link_element.attrs:
                        link = link_element['href']
                        if not link.startswith("http"):
                            link = f"https://edition.cnn.com{link}"

5개만 크롤링해서 5개를 저장하도록 수정

1775초 → 347초로 약 5배 빨라짐(시간 80% 절약)

의사결정

크롤링에 실패할때를 대비하여 6~7개의 기사를 크롤링하고 그 중 5개 저장 vs 5개 크롤링 후 5개 저장

NYT api를 통해서도 기사가 마련되기 때문에 몇 개의 기사에서 크롤링을 실패하더라도 db에는 이미 여분의 기사가 있음

자주 발생하는 에러(SSL 인증실패)는 코드로 수정함(트러블슈팅에 따로 서술) → 5개만 크롤링하고 저장하는 방식으로 결정

celery를 통해 비동기 방식으로 처리할 텐데 시간을 절약하는 의미가 있는지?

다음과 같은 이유로 의미가 있음

(1) 서버 차단 방지 CNN과 같은 웹사이트는 비정상적으로 많은 요청을 보내는 IP를 감지하여 차단할 가능성이 있음 → 요청을 줄이면 서버가 크롤링을 탐지하지 못하게 될 확률이 높아짐

(2) 리소스 절약 Selenium WebDriver는 CPU, 메모리, 네트워크 자원을 많이 소모→ 크롤링 시간을 줄이면 서버와 클라이언트 자원이 절약됨

(3) 병렬 작업 효율성 이후 다른 언론사 추가 또는 다른 기능의 추가 가능성을 생각하면, 비동기로 작업을 처리하더라도 한 번에 많은 자원을 사용하는 작업이 동시에 실행되면 병목 현상이 발생할 수 있음→각 작업의 처리 시간을 줄이면 더 많은 병렬 작업을 효율적으로 처리할수있음