🔥알림🔥
① 테디노트 유튜브 - 구경하러 가기!
② LangChain 한국어 튜토리얼 바로가기 👀
③ 랭체인 노트 무료 전자책(wikidocs) 바로가기 🙌
④ RAG 비법노트 LangChain 강의오픈 바로가기 🙌
⑤ 서울대 PyTorch 딥러닝 강의 바로가기 🙌

로또 1회부터 최신회차까지 크롤링한 뒤 파일 저장하기(requests, beautifulsoup4)

2023년 02월 02일 4 분 소요

본 내용은 로또 사이트(동행복권) 에서 로또의 1회차 부터 최신회차까지 당첨번호, 보너스번호,당첨일자등의 정보를 크롤링 하여 데이터프레임으로 변환하고 CSV 파일형식으로 저장하는 튜토리얼입니다.

실습파일

🔥 정보 크롤링

날짜
당첨번호
보너스번호
1~5등 인당 당첨금액
각 등수별 당첨자 수

import requests
from bs4 import BeautifulSoup


## 동행복권 사이트 N회차 크롤링

# 예시) 10회차 데이터 크롤링
count = 10

# url에 회차 {count} 를 실어 페이지 조회
url = f'https://dhlottery.co.kr/gameResult.do?method=byWin&drwNo={count}'
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')

- 날짜

# 날짜 조회
soup.find('p', class_='desc')

<p class="desc">(2003년 02월 08일 추첨)</p>

# 날짜만 추출
soup.find('p', class_='desc').text

'(2003년 02월 08일 추첨)'

datetime으로 변경

from datetime import datetime

# 날짜 클린징
date = datetime.strptime(soup.find('p', class_='desc').text, '(%Y년 %m월 %d일 추첨)')
date

datetime.datetime(2003, 2, 8, 0, 0)

- 당첨번호

soup.find('div', class_='num win')

<div class="num win">
<strong>당첨번호</strong>
<p>
<span class="ball_645 lrg ball1">9</span>
<span class="ball_645 lrg ball3">25</span>
<span class="ball_645 lrg ball3">30</span>
<span class="ball_645 lrg ball4">33</span>
<span class="ball_645 lrg ball5">41</span>
<span class="ball_645 lrg ball5">44</span>
</p>
</div>

# 당첨번호 TEXT화
soup.find('div', class_='num win').find('p').text

'\n9\n25\n30\n33\n41\n44\n'

# 당첨번호 추출
soup.find('div', class_='num win').find('p').text.strip().split('\n')

['9', '25', '30', '33', '41', '44']

# int로 변경
[int(i) for i in soup.find('div', class_='num win').find('p').text.strip().split('\n')]

[9, 25, 30, 33, 41, 44]

- 보너스 번호

soup.find('div', class_='num bonus')

<div class="num bonus">
<strong>보너스</strong>
<p><span class="ball_645 lrg ball1">6</span></p>
</div>

int(soup.find('div', class_='num bonus').find('p').text.strip())

- 1~5등 정보 추출

soup.find('table', class_='tbl_data')

<table class="tbl_data tbl_data_col">
<caption>10회  순위별 등위별 총 당첨금액, 당첨게임 수, 1게임당 당첨금액, 당첨기준, 비고 안내</caption>
<colgroup>
<col style="width:85px"/>
<col style="width:180px"/>
<col style="width:145px"/>
<col style="width:180px"/>
<col/>
<col style="width:110px"/>
</colgroup>
<thead>
<tr>
<th scope="col">순위</th>
<th scope="col">등위별 총 당첨금액</th>
<th scope="col">당첨게임 수</th>
<th scope="col">1게임당 당첨금액</th>
<th scope="col">당첨기준</th>
<th scope="col">비고</th>
</tr>
</thead>
<tbody>
<tr>
<td>1등</td>
<td class="tar"><strong class="color_key1">83,595,692,700원</strong></td>
<td>13</td>
<td class="tar">6,430,437,900원</td>
<td>당첨번호 <strong class="length">6개</strong> 숫자일치</td>
<td rowspan="5">
</td>
</tr>
<tr>
<td>2등</td>
<td class="tar"><strong class="color_key1">9,631,962,400원</strong></td>
<td>236</td>
<td class="tar">40,813,400원</td>
<td class="nobd_right">당첨번호 <strong class="length">5개</strong> 숫자일치<br/>+<strong class="length">보너스</strong> 숫자일치</td>
</tr>
<tr>
<td>3등</td>
<td class="tar"><strong class="color_key1">9,631,930,800원</strong></td>
<td>11,247</td>
<td class="tar">856,400원</td>
<td class="nobd_right">당첨번호 <strong class="length">5개</strong> 숫자일치</td>
</tr>
<tr>
<td>4등</td>
<td class="tar"><strong class="color_key1">19,198,288,200원</strong></td>
<td>703,234</td>
<td class="tar">27,300원</td>
<td class="nobd_right">당첨번호 <strong class="length">4개</strong> 숫자일치</td>
</tr>
<tr>
<td>5등</td>
<td class="tar"><strong class="color_key1">34,108,460,000원</strong></td>
<td>3,410,846</td>
<td class="tar">10,000원</td>
<td class="nobd_right">당첨번호 <strong class="length">3개</strong> 숫자일치</td>
</tr>
</tbody>
</table>

import pandas as pd

df = pd.read_html(str(soup.find('table', class_='tbl_data')))[0]
df

	순위	등위별 총 당첨금액	당첨게임 수	1게임당 당첨금액	당첨기준	비고
0	1등	83,595,692,700원	13	6,430,437,900원	당첨번호 6개 숫자일치	NaN
1	2등	9,631,962,400원	236	40,813,400원	당첨번호 5개 숫자일치 +보너스 숫자일치	NaN
2	3등	9,631,930,800원	11247	856,400원	당첨번호 5개 숫자일치	NaN
3	4등	19,198,288,200원	703234	27,300원	당첨번호 4개 숫자일치	NaN
4	5등	34,108,460,000원	3410846	10,000원	당첨번호 3개 숫자일치	NaN

- 1회차부터 N회차 정보 일괄 크롤링

def crawling_lotto(count):
    # url에 회차를 실어 페이지 조회
    url = f'https://dhlottery.co.kr/gameResult.do?method=byWin&drwNo={count}'
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    
    date = datetime.strptime(soup.find('p', class_='desc').text, '(%Y년 %m월 %d일 추첨)')
    win_number = [int(i) for i in soup.find('div', class_='num win').find('p').text.strip().split('\n')]
    bonus_number = int(soup.find('div', class_='num bonus').find('p').text.strip())
    
    return {
        'date': date, 
        'win_number': win_number, 
        'bonus_number': bonus_number
    }

# 2회차 샘플 크롤링
result = crawling_lotto(2)

# 데이터프레임 출력
data = pd.DataFrame()
data = data.append({'date': result['date'],
                    'num1': result['win_number'][0],
                    'num2': result['win_number'][1],
                    'num3': result['win_number'][2],
                    'num4': result['win_number'][3],
                    'num5': result['win_number'][4],
                    'num6': result['win_number'][5],
                    'bonus': result['bonus_number'],
                   }, ignore_index=True)
data

	date	num1	num2	num3	num4	num5	num6	bonus
0	2002-12-14	9	13	21	25	32	42	2

🔥 최신 회차 가져오기

url = 'https://dhlottery.co.kr/common.do?method=main'
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
# 최신 회차 출력
max_count = int(soup.find('strong', id='lottoDrwNo').text)
max_count

🔥 [최종코드] 1회 ~ 최신회차 크롤링

import warnings
import requests
from datetime import datetime
from tqdm.notebook import tqdm
import pandas as pd
from bs4 import BeautifulSoup


# wanring 메시지 출력 안함
warnings.filterwarnings('ignore')

# 최신 회차 크롤링 함수
def get_max_count():
    url = 'https://dhlottery.co.kr/common.do?method=main'
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    max_count = int(soup.find('strong', id='lottoDrwNo').text)
    return max_count

# 로또 당첨번호 정보 조회 함수
def crawling_lotto(count):
    # url에 회차를 실어 페이지 조회
    url = f'https://dhlottery.co.kr/gameResult.do?method=byWin&drwNo={count}'
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    
    date = datetime.strptime(soup.find('p', class_='desc').text, '(%Y년 %m월 %d일 추첨)')
    win_number = [int(i) for i in soup.find('div', class_='num win').find('p').text.strip().split('\n')]
    bonus_number = int(soup.find('div', class_='num bonus').find('p').text.strip())
    
    return {
        'date': date, 
        'win_number': win_number, 
        'bonus_number': bonus_number
    }

# 최신 회차 가져오기
max_count = get_max_count()
# 전체 회차 크롤링
data = pd.DataFrame()
for i in tqdm(range(1, max_count+1)):
    result = crawling_lotto(i)
    data = data.append({'date': result['date'],
                        'num1': result['win_number'][0],
                        'num2': result['win_number'][1],
                        'num3': result['win_number'][2],
                        'num4': result['win_number'][3],
                        'num5': result['win_number'][4],
                        'num6': result['win_number'][5],
                        'bonus': result['bonus_number'],
                       }, ignore_index=True)
data

  0%|          | 0/1052 [00:00<?, ?it/s]

	date	num1	num2	num3	num4	num5	num6	bonus
0	2002-12-07	10	23	29	33	37	40	16
1	2002-12-14	9	13	21	25	32	42	2
2	2002-12-21	11	16	19	21	27	31	30
3	2002-12-28	14	27	30	31	40	42	2
4	2003-01-04	16	24	29	40	41	42	3
...	...	...	...	...	...	...	...	...
1047	2022-12-31	6	12	17	21	32	39	30
1048	2023-01-07	3	5	13	20	21	37	17
1049	2023-01-14	6	12	31	35	38	43	17
1050	2023-01-21	21	26	30	32	33	35	44
1051	2023-01-28	5	17	26	27	35	38	1

1052 rows × 8 columns

CSV로 저장

data.to_csv('lotto-1052.csv', index=False)

Twitter Facebook LinkedIn

로또 1회부터 최신회차까지 크롤링한 뒤 파일 저장하기(requests, beautifulsoup4)

실습파일

🔥 정보 크롤링

- 날짜

- 당첨번호

- 보너스 번호

- 1~5등 정보 추출

- 1회차부터 N회차 정보 일괄 크롤링

🔥 최신 회차 가져오기

🔥 [최종코드] 1회 ~ 최신회차 크롤링

CSV로 저장

공유하기

댓글남기기

참고

poetry 의 거의 모든것 (튜토리얼)

LangGraph Retrieval Agent를 활용한 동적 문서 검색 및 처리

[Assistants API] Code Interpreter, Retrieval, Functions 활용법

[LangChain] 에이전트(Agent)와 도구(tools)를 활용한 지능형 검색 시스템 구축 가이드