🔥알림🔥
① 테디노트 유튜브 - 구경하러 가기!
② LangChain 한국어 튜토리얼 바로가기 👀
③ 랭체인 노트 무료 전자책(wikidocs) 바로가기 🙌

12 분 소요

이번 튜토리얼에서는 investing.com 의 뉴스기사를 크롤링 후 ChatGPT로 영문 뉴스기사를 요약하고, 이를 한글로 번역하는 튜토리얼을 진행해 보겠습니다.

실습파일


진행 과정

  1. investing.com의 뉴스기사(본문) 크롤링.

  2. chatgpt 에 요약 요청.

  3. 요약을 한글로 변환.

(❌불가) beautifulsoup4 + requests 로 크롤링 시도

  • 시도를 해봤으나 사이트에서 requests + beautifulsoup4 를 사용한 크롤링 시도가 막혀 있어서 사용이 불가했습니다.

실행하기 위한 설치코드

# !pip install beautifulsoup4
# !pip install requests

요청을 위한 코드

import requests
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent

# user-agent 설정
ua = UserAgent()
headers = {'User-Agent':str(ua.chrome)}

# 요청하고자 하는 샘플 뉴스기사 URL
url = 'https://www.investing.com/analysis/us-stock-market-has-plenty-of-reasons-to-rally-after-feds-decision-200634857'

# requests 요청
page = requests.get(url, headers=headers)
page
<Response [403]>

결과는 403 response가 뜨는 것을 확인할 수 있습니다.

따라서, selenium 으로 크롤링을 진행하는 것으로 우회하도록 하겠습니다.

1. [크롤링] Selenium

Selenium 설치 코드

# !pip install selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

# 크롬드라이버 셋팅
def set_chrome_driver(headless=True):
    options = webdriver.ChromeOptions()
    if headless:
        options.add_argument('headless')
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36")
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    return driver

set_chrome_driver는 chrome 드라이버를 셋업하는 함수입니다.

만약 로컬에 chromedriver가 존재하지 않으면 알아서 올바른 버전을 다운로드 진행하게 됩니다(이전에는 chromedriver를 직접 다운로드 했었어야 합니다).

# driver 설정
driver = set_chrome_driver(False)

# URL 요청
driver.get(url)

# aritivlePage는 신문기사의 본문입니다
article_page = driver.find_element(By.CLASS_NAME, 'articlePage')
article_page
<selenium.webdriver.remote.webelement.WebElement (session="d0e6dbbcd5f3fa889cfa228f445c9576", element="0053d389-fa03-426c-b735-52c2dc17e736")>

아래의 코드를 실행하여 신문기사의 본문이 잘 크롤링 되었는지 확인합니다.

# 신문기사의 본문을 출력합니다.
print(article_page.text)
Francesco Casarella/Investing.com
Articles (31)
Follow
  US500
-0.53%
Markets are cautious ahead of the Fed
After a tumultuous 2022, many investors are waiting on the sidelines, holding cash and waiting to enter
With risk-off sentiment dominating and plenty of liquidity on the sidelines, markets could rally in the second half of the week
Yesterday, the S&P 500 closed lower. This is nothing new, considering the same thing happened the last three times Powell spoke.
S&P 500 Daily Chart
I don't expect any surprises. A 25bp hike and Powell maintaining his stance on fighting inflation ("we're improving, but it's not time to rest yet") is likely. As always, the markets are pricing in such a scenario.
In the meantime, while the focus is still on the recession and earnings (we will have a dedicated analysis as soon as the quarters are over), there are other situations worthy of consideration.
Cash Levels at All-Time Highs
After the sell-off in 2022, there is still a lot of liquidity on the sidelines that needs to be deployed. We can see above that several funds are at record highs not seen for years (curiously, they were also at very high levels in 2009 as the market recovered from the subprime bubble).
The buyback announcements made by various companies in January could help support prices.
US Share Buyback Authorizations
Generally, we are not seeing the euphoria typical of bubble bursts, where the collapse comes after markets are taken by surprise.
After a year like 2022, the markets are already negative as far as sentiment is concerned, and if we look at the chart below, we can see that traders are still in a risk-off mode.
Usually, when traders are negative, there is a lot of caution, and as a result, it is difficult to be caught off guard if the market declines further.
Risk Off/Risk On Sentiment
However, the surprise could come from the opposite direction. A continuation of the rally could generate a buying frenzy in a self-reinforcing mechanism between closing shorts and new buying.
In this sense, this week will not be so much about the FOMC's decision on the size of the hike, nor even about Powell's words (which I think will be confirmed as hawkish).
Instead, it will be about how the markets react in the second half of the week and the week after.
Disclaimer: This article is written for informational purposes only; it does not constitute a solicitation, offer, advice, counseling, or recommendation to invest as such it is not intended to incentivize the purchase of assets in any way. I want to remind you that any type of asset is evaluated from multiple points of view and is highly risky; therefore, any investment decision and the associated risk remain with the investor.

2. [요약] OpenAI API (ChatGPT)

API KEY 발급 방법

  • API KEY 신청 주소

  • https://beta.openai.com/ 회원 가입 후

  • https://beta.openai.com/account/api-keys

  • create new secret key

openai 라이브러리 설치코드

# !pip install openai

먼저 openai 라이브러리 import 후 API KEY를 설정 합니다.

import openai

# API 키 설정
openai.api_key = "OpenAI API Key 입력"

요약 및 긍/부정 감정 분석을 진행하는 프롬프트를 생성하여 출력 합니다.

# 프롬프트 (요약해줘 + 긍/부정 감정도 분석해줘)
prompt = f'''
Summarize the paragraph below and interpret whether it is a positive or negative sentiment.

{article_page.text}
'''
print(prompt)

Summarize the paragraph below and interpret whether it is a positive or negative sentiment.

Francesco Casarella/Investing.com
Articles (31)
Follow
  US500
-0.53%
Markets are cautious ahead of the Fed
After a tumultuous 2022, many investors are waiting on the sidelines, holding cash and waiting to enter
With risk-off sentiment dominating and plenty of liquidity on the sidelines, markets could rally in the second half of the week
Yesterday, the S&P 500 closed lower. This is nothing new, considering the same thing happened the last three times Powell spoke.
S&P 500 Daily Chart
I don't expect any surprises. A 25bp hike and Powell maintaining his stance on fighting inflation ("we're improving, but it's not time to rest yet") is likely. As always, the markets are pricing in such a scenario.
In the meantime, while the focus is still on the recession and earnings (we will have a dedicated analysis as soon as the quarters are over), there are other situations worthy of consideration.
Cash Levels at All-Time Highs
After the sell-off in 2022, there is still a lot of liquidity on the sidelines that needs to be deployed. We can see above that several funds are at record highs not seen for years (curiously, they were also at very high levels in 2009 as the market recovered from the subprime bubble).
The buyback announcements made by various companies in January could help support prices.
US Share Buyback Authorizations
Generally, we are not seeing the euphoria typical of bubble bursts, where the collapse comes after markets are taken by surprise.
After a year like 2022, the markets are already negative as far as sentiment is concerned, and if we look at the chart below, we can see that traders are still in a risk-off mode.
Usually, when traders are negative, there is a lot of caution, and as a result, it is difficult to be caught off guard if the market declines further.
Risk Off/Risk On Sentiment
However, the surprise could come from the opposite direction. A continuation of the rally could generate a buying frenzy in a self-reinforcing mechanism between closing shorts and new buying.
In this sense, this week will not be so much about the FOMC's decision on the size of the hike, nor even about Powell's words (which I think will be confirmed as hawkish).
Instead, it will be about how the markets react in the second half of the week and the week after.
Disclaimer: This article is written for informational purposes only; it does not constitute a solicitation, offer, advice, counseling, or recommendation to invest as such it is not intended to incentivize the purchase of assets in any way. I want to remind you that any type of asset is evaluated from multiple points of view and is highly risky; therefore, any investment decision and the associated risk remain with the investor.

ChatGPT에 요약을 요청합니다.

def summarize(prompt):
    # 모델 엔진 선택
    model_engine = "text-davinci-003"

    # 맥스 토큰
    max_tokens = 3000

    # 요약 요청
    completion = openai.Completion.create(
        engine=model_engine,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=0.3,       # creativity
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    return completion
# 요약 요청후 결과 return
response = summarize(prompt)

# 결과 출력
print(response.choices[0].text)

This paragraph is discussing the cautious sentiment of investors in the markets due to the tumultuous year of 2022 and the potential for a rally in the second half of the week. It also mentions the high levels of cash on the sidelines and the potential for a buying frenzy if the markets continue to rally. Overall, this is a positive sentiment.

3. [번역] 파파고 번역

요약본을 한글로 번역하기 위하여 PAPAGO 번역 사이트를 이용하였습니다.

프로세스

  1. 파파고 번역 요청을 위한 selenium 브라우저를 생성

  2. 영문 텍스트를 입력

  3. 번역된 한글 텍스트를 크롤링

# 샘플 텍스트
summarized = '''This paragraph is summarizing the current state of the stock market, with the Dow Jones Industrial Average, S&P 500, and NASDAQ Composite all down, while Gold Futures and Advanced Micro Devices Inc (AMD) rose. Investors are cautious ahead of the Federal Reserve's rate decision, and corporate earnings are mixed, with Electronic Arts Inc (EA) and Snap Inc (SNAP) falling and AMD rising. Oil prices are also down. Overall, the sentiment of the paragraph is negative.'''
import time

# 1. 파파고 번역 요청을 위한 selenium 생성
papago = set_chrome_driver(False)
papago.get(f'https://papago.naver.com/')

# 2. 영문 문장 입력
papago.find_element(By.ID, 'txtSource').send_keys(summarized)
# 2. 번역 버튼 클릭
papago.find_element(By.ID, 'btnTranslate').click()

# 강제 지연 시간: 번역을 기다릴 수 있도록 2초 슬립
time.sleep(2)

# 3. 번역된 한글 텍스트 크롤링
papago_translated = papago.find_element(By.ID, 'targetEditArea')
print(papago_translated.text)
이 단락은 다우존스 산업평균지수, S&P 500, 나스닥 종합지수가 모두 하락한 반면 골드 퓨처스와 어드밴스드 마이크로 디바이스 주식회사(AMD)는 상승하는 등 주식시장의 현주소를 요약하고 있다. 미국 연방준비제도(Fed·연준)의 금리 결정을 앞두고 투자자들이 신중한 모습을 보이고 있고, 일렉트로닉아츠(EA)와 스냅인(SNAP)이 하락하고 AMD가 상승하는 등 기업 실적도 엇갈리고 있다. 유가도 하락했다. 전반적으로 문단의 정서는 부정적이다.

위의 내용을 아래에 함수화 합니다.

번역된 텍스트 크롤링 시 오류가 날 수 있으므로, 이에 대한 예외처리도 적용합니다.

def papago_translate(text):
    try:
        papago = set_chrome_driver(False)
        papago.get('https://papago.naver.com/')
        time.sleep(1)
        papago.find_element(By.ID, 'txtSource').send_keys(text)
        papago.find_element(By.ID, 'btnTranslate').click()
        time.sleep(2)
        papago_translated = papago.find_element(By.ID, 'targetEditArea')
        result = papago_translated.text
    except NoSuchElementException: # 예외처리 (요소를 찾지 못하는 경우)
        result = '번역 오류ㅠㅠ'
    finally:
        papago.close()
    return result

🔅 [최종] 크롤링+요약+번역 코드

import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import openai


# OPENAI API키 설정
openai.api_key = "OpenAI API Key 입력"

# 크롬드라이버 셋팅
def set_chrome_driver(headless=True):
    options = webdriver.ChromeOptions()
    if headless:
        options.add_argument('headless')
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36")
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    return driver

# 뉴스 페이지 크롤링
def crawl_page(url):
    try:
        driver = set_chrome_driver(False)
        driver.get(url)
        # 요소 변경 가능
        article_page = driver.find_element(By.CLASS_NAME, 'articlePage')
        text = article_page.text
        driver.close()
    except NoSuchElementException:
        text = ""
    return text

# ChatGPT 요약
def summarize(text):
    # 모델 엔진 선택
    model_engine = "text-davinci-003"

    # 맥스 토큰
    max_tokens = 2500
    
    # 프롬프트 (요약해줘!)
    prompt = f'''Summarize the paragraph below and interpret whether it is a positive or negative sentiment.

    {text}
    '''

    # 요약 요청
    completion = openai.Completion.create(
        engine=model_engine,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=0.3,      # creativity
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    return completion.choices[0].text

# 파파고 번역
def papago_translate(text):
    try:
        papago = set_chrome_driver(False)
        papago.get('https://papago.naver.com/')
        time.sleep(1)
        papago.find_element(By.ID, 'txtSource').send_keys(text)
        papago.find_element(By.ID, 'btnTranslate').click()
        time.sleep(2)
        papago_translated = papago.find_element(By.ID, 'targetEditArea')
        result = papago_translated.text
    except NoSuchElementException: # 예외처리 (요소를 찾지 못하는 경우)
        result = '번역 오류ㅠㅠ'
    finally:
        papago.close()
    return result

# 최종 wrapper
def summarize_news(url):
    page = crawl_page(url)
    summarized = summarize(page)
    print('[원문 요약]')
    print(summarized)
    korean_translated = papago_translate(summarized)
    print('\n[한글 요약]')
    print(korean_translated)
    return korean_translated

1개의 샘플 URL로 테스트를 진행합니다.

_ = summarize_news('https://www.investing.com/analysis/traders-send-wheat-prices-spiking-as-allied-tanks-aid-to-roll-into-ukraine-200634894')
[원문 요약]

This paragraph is discussing the current state of the wheat market, which has been affected by the Ukraine war and other factors. It is a neutral sentiment, as it is simply providing information about the current state of the market.

[한글 요약]
이 단락은 우크라이나 전쟁과 다른 요인들의 영향을 받은 밀 시장의 현재 상태를 논의하고 있다. 그것은 단지 시장의 현재 상태에 대한 정보를 제공하는 것이기 때문에 중립적인 정서이다.

[MORE] TOP5 신문기사 크롤링+요약+번역

  • top5(Most Popular) 신문기사의 링크를 크롤링 해온 뒤 이를 모두 요약+번역할 수 있도록 발전시킨 코드입니다.

먼저, investing.com의 Most Popular에 게재된 Top5 신문기사를 크롤링합니다.

# most popular news 의 신문기사 요소를 크롤링 합니다
top5 = set_chrome_driver(False)
# URL 요청
top5.get('https://www.investing.com/news/most-popular-news')
# 5개의 요소만 가져옵니다.
top5.find_element(By.CLASS_NAME, 'largeTitle').find_elements(By.CLASS_NAME, 'js-article-item')[:5]
[<selenium.webdriver.remote.webelement.WebElement (session="7af93f9bf399bacaaf635c50b3e0fd94", element="ade4efab-ccba-4907-8527-12ddfd0f156c")>,
 <selenium.webdriver.remote.webelement.WebElement (session="7af93f9bf399bacaaf635c50b3e0fd94", element="ad1ba5e8-443f-475c-874d-6f53b9dc9043")>,
 <selenium.webdriver.remote.webelement.WebElement (session="7af93f9bf399bacaaf635c50b3e0fd94", element="8992e0f9-f6b5-4b82-a95e-741432b50039")>,
 <selenium.webdriver.remote.webelement.WebElement (session="7af93f9bf399bacaaf635c50b3e0fd94", element="0eec2883-fa65-4b1f-9b5a-6b41b0487256")>,
 <selenium.webdriver.remote.webelement.WebElement (session="7af93f9bf399bacaaf635c50b3e0fd94", element="3a1bbf5f-6fc0-44d7-92a8-9e73649c0c2e")>]

크롤링 한 Top5 신문기사의 URL을 추출합니다.

# 5개의 신문기사 URL 만 추출 합니다.
top5_links = []

for link in top5.find_element(By.CLASS_NAME, 'largeTitle').find_elements(By.CLASS_NAME, 'js-article-item')[:5]:
    top5_links.append(link.find_element(By.CSS_SELECTOR, 'a').get_attribute('href'))
    
top5_links
['https://www.investing.com/news/economy/fed-hikes-by-025-in-further-downshift-on-tightening-but-sees-more-hikes-ahead-2993513',
 'https://www.investing.com/news/economy/futures-dip-on-jitters-ahead-of-adp-report-fed-decision-2993229',
 'https://www.investing.com/news/stock-market-news/us-stocks-are-falling-ahead-of-the-feds-rate-decision-2993346',
 'https://www.investing.com/jp.php?v2=ZSU2aDZhYjhiM2xlZzZiZzNrZT83MDE6Z3BjMWBqNH1jJTQ9MWkydDc_OSdiPmE7M0BmOTY-MyUwZm89OntiIWUiNmg2ZmI3YjVsYWciYiMzb2U8NzgxNmdwYyBgag==',
 'https://www.investing.com/news/economy/fed-decision-snaps-falling-sales-meta-earnings--whats-moving-markets-2993057']

마지막으로, 이전에 생성한 summarize_news 함수를 활용하여 크롤링 + 요약 + 번역을 진행하고 결과를 확인합니다.

# 5개의 신문기사 링크에 대한 본문 크롤링+요약+번역 을 진행합니다.
top5_summarize = []

for link in top5_links:
    output = summarize_news(link)
    top5_summarize.append(output)
    print()
[원문 요약]

This paragraph is summarizing the Federal Reserve's decision to raise interest rates by 0.25%, and the sentiment is generally positive. It notes that the move is in line with the Fed's goal of returning inflation to 2%, and that the labor market remains strong. It also mentions that the markets are expecting Powell to push back against the easing financial conditions, but that the response will likely be positive.

[한글 요약]
이 문단은 연방준비제도이사회(FRB)가 금리를 0.25% 인상하기로 한 결정을 정리한 것으로 대체로 긍정적인 분위기다. 물가상승률을 2%로 되돌리겠다는 연준의 목표와 맞물려 노동시장이 강세를 유지하고 있다는 점에 주목한다. 그것은 또한 시장이 파월이 완화적인 금융 조건에 대해 반발할 것으로 기대하고 있지만, 반응은 긍정적일 가능성이 높다고 언급한다.

[원문 요약]

This paragraph is discussing the performance of the stock market in the US, with the Dow, S&P 500, and Nasdaq all falling ahead of the Federal Reserve's decision on interest rates. It also mentions that Advanced Micro Devices Inc had a positive outlook, while Snap Inc and Electronic Arts Inc had negative outlooks. Overall, the sentiment of this paragraph is negative as the stock market indexes are down and some companies had negative outlooks.

[한글 요약]
이 단락은 연방준비제도의 금리 결정을 앞두고 다우지수, S&P 500지수, 나스닥지수가 모두 하락한 가운데 미국 증시의 실적을 논의하고 있다. 또한 Advanced Micro Devices Inc는 긍정적인 전망을, Snap Inc와 Electronic Arts Inc는 부정적인 전망을 나타냈습니다. 전반적으로 이 단락의 정서는 주가지수가 하락하고 일부 기업들이 부정적인 전망을 했기 때문에 부정적이다.

[원문 요약]

This paragraph is summarizing the current state of the stock market and the Federal Reserve's decision on interest rates. It also mentions the job market, corporate earnings, and oil prices. Overall, the sentiment is neutral as some stocks are rising while others are falling.

[한글 요약]
이 단락은 주식 시장의 현황과 연방 준비 제도 이사회의 금리 결정을 요약하고 있다. 그것은 또한 고용 시장, 기업 수익, 유가를 언급한다. 일부 종목은 오르고 있는 반면 다른 종목은 하락하고 있어 전반적으로 심리가 중립적이다.

[원문 요약]

This product is not worth the money.

This paragraph expresses a negative sentiment about a product not being worth the money.

[한글 요약]
이 제품은 값어치가 없다.

이 단락은 제품이 그 돈의 가치가 없다는 부정적인 감정을 표현한다.

[원문 요약]

This paragraph provides an overview of the financial markets on February 1st. It mentions the Federal Reserve's decision to raise interest rates, economic data releases, stock market performance, euro zone inflation, OPEC+ production quotas, and U.S. crude futures. The sentiment is neutral, as it simply provides an overview of the day's events.

[한글 요약]
이 단락은 2월 1일의 금융 시장에 대한 개요를 제공한다. 금리 인상, 경제 데이터 공개, 증시 성과, 유로존 인플레이션, OPEC+ 생산 쿼터, 미국 원유 선물 등을 언급하고 있다. 그 정서는 단지 그날의 사건들에 대한 개요를 제공하기 때문에 중립적이다.

댓글남기기