🔥알림🔥
① 테디노트 유튜브 -
구경하러 가기!
② LangChain 한국어 튜토리얼
바로가기 👀
③ 랭체인 노트 무료 전자책(wikidocs)
바로가기 🙌
④ RAG 비법노트 LangChain 강의오픈
바로가기 🙌
⑤ 서울대 PyTorch 딥러닝 강의
바로가기 🙌
bbc-text.csv 데이터셋을 활용한 BBC 뉴스 아티클 카테고리 분류기 만들기
BBC 뉴스 아티클 묶음 데이터셋인 bbc-text.csv
파일을 활용하여 TensorFlow 의 Tokenizer로 단어 사전을 만들고 자연어 처리 모델 학습을 위한 데이터 전처리를 진행해 보겠습니다. bbc-text.csv
파일을 pandas로 읽어와서 데이터프레임 변환 후 라벨 인코딩을 포함한 간단한 전처리를 다룹니다.
문장 데이터(text) 전처리에서는 토크나이저 생성, 단어 사전 생성, 불용어(stopwords) 처리, 시퀀스 변환 등을 다룹니다.
모델링에서는 Embedding layer
와 Bidirectional LSTM
으로 BBC 뉴스 아티클의 뉴스 카테고리 분류기를 생성하겠습니다.
(본 예제는 텐서플로 자격 인증 시험(TensorFlow Developers Certificate)의 기출 문제 중 하나를 다뤄 본 튜토리얼입니다.)
Dataset Reference
About this file
Source data from public data set on BBC news articles:
D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. [PDF] [BibTeX].
http://mlg.ucd.ie/datasets/bbc.html
Cleaned up version exported to https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv
필요한 모듈 import
import tensorflow as tf
import numpy as np
import urllib
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint
데이터셋 다운로드
url = 'https://storage.googleapis.com/download.tensorflow.org/data/bbc-text.csv'
urllib.request.urlretrieve(url, 'bbc-text.csv')
다운로드 받은 bbc-text.csv
파일을 읽어서 df 변수에 로드 합니다.
df = pd.read_csv('bbc-text.csv')
df
category | text | |
---|---|---|
0 | tech | tv future in the hands of viewers with home th... |
1 | business | worldcom boss left books alone former worldc... |
2 | sport | tigers wary of farrell gamble leicester say ... |
3 | sport | yeading face newcastle in fa cup premiership s... |
4 | entertainment | ocean s twelve raids box office ocean s twelve... |
... | ... | ... |
2220 | business | cars pull down us retail figures us retail sal... |
2221 | politics | kilroy unveils immigration policy ex-chatshow ... |
2222 | entertainment | rem announce new glasgow concert us band rem h... |
2223 | politics | how political squabbles snowball it s become c... |
2224 | sport | souness delight at euro progress boss graeme s... |
2225 rows × 2 columns
Label 값 확인
category
종류 확인
df['category'].value_counts()
sport 511 business 510 politics 417 tech 401 entertainment 386 Name: category, dtype: int64
위의 value_counts()
함수로 label의 value는 sport,
business,
politics,
tech,
entertainment` 이렇게 5가지의 종류가 존재합니다.
하지만, TensorFlow Certificate 시험에서는 아래와 같은 가이드라인을 줍니다.
PLEASE NOTE -- WHILE THERE ARE 5 CATEGORIES, THEY ARE NUMBERED 1 THROUGH 5 IN THE DATASET
SO IF YOU ONE-HOT ENCODE THEM, THEY WILL END UP WITH 6 VALUES, SO THE OUTPUT LAYER HERE
SHOULD ALWAYS HAVE 6 NEURONS AS BELOW. MAKE SURE WHEN YOU ENCODE YOUR LABELS THAT YOU USE
THE SAME FORMAT, OR THE TESTS WILL FAIL
0 = UNUSED
1 = SPORT
2 = BUSINESS
3 = POLITICS
4 = TECH
5 = ENTERTAINMENT
즉, 5개의 카테고리가 존재하는 것은 맞지만 0번 label 에는 UNUSED
항목을 남겨 놓고, 1~5번 라벨에 맵핑되는 카테고리를 규정하고 있습니다.
반드시 위의 번호에 맞게 라벨 인코딩을 해줘야 채점서버에서 올바르게 채점을 진행할 수 있기에, 라벨 인코딩시 위의 규정에 따라 인코딩을 진행하여야 합니다.
# category encoding map
m = {
'unused': 0,
'sport': 1,
'business': 2,
'politics': 3,
'tech': 4,
'entertainment': 5
}
# map 함수로 인코딩 변환
df['category'] = df['category'].map(m)
df['category'].value_counts()
1 511 2 510 3 417 4 401 5 386 Name: category, dtype: int64
0번 라벨 값은 없었기 때문에 0번을 제외한 1~5번까지 올바르게 출력됨을 확인할 수 있습니다.
# hyperparameter settings
vocab_size = 1000
embedding_dim = 16
max_length = 120
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size = 2000
# 불용어 정의
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]
아래에서 sentences
와 labels
변수에 text
와 category
컬럼을 리스트 변환하여 담은 뒤 간단한 전처리를 수행합니다.
# sentence 와 labals를 분리 합니다.
sentences = df['text'].tolist()
labels = df['category'].tolist()
cleaned_sentences
라는 빈 리스트를 생성한 뒤, 각 문장에서 불용어를 제외한 뒤 다시 쪼개진 단어를 합쳐서 추가합니다.
불용어(stopwords) 란?
문장 내에서 빈번하게 발생하여 의미를 부여하기 어려운 단어들을 의미합니다.
‘a’, ‘the’, ‘in’ 같은 단어들은 모든 구문에 빈번히 등장하지만 의미가 없습니다.
특히 불용어는 자연어 처리에 있어 효율성을 감소시키기 때문에 가능하다면 제거하는 것이 좋습니다.
cleaned_sentences = []
for sentence in sentences:
# list comprehension
cleaned = [word for word in sentence.split() if word not in stopwords]
cleaned_sentences.append(' '.join(cleaned))
# 불용어 처리 전
print(f'[불용어 처리 전] {sentences[0][:100]}')
# 불용어 처리 후
print(f'[불용어 처리 후] {cleaned_sentences[0][:100]}')
[불용어 처리 전] tv future in the hands of viewers with home theatre systems plasma high-definition tvs and digital [불용어 처리 후] tv future hands viewers home theatre systems plasma high-definition tvs digital video recorders movi
train
/ validation
셋으로 나눕니다.
train_sentences = cleaned_sentences[:training_size]
validation_sentences = cleaned_sentences[training_size:]
train_labels = labels[:training_size]
validation_labels = labels[training_size:]
토크나이저 정의
tensorflow.keras.preprocessing.text.Tokenizer
를 생성합니다.
-
num_words
: 몇 개의 단어 사전을 활용할지 지정합니다. -
oov_token
: Out of Vocab 토큰을 지정합니다. 보통 겹치지 않는(일반적으로 사용하지 않은 특수 문자 조합으로..) 문자열로 지정합니다.
# vocab_size = 1000
# oov_token 지정
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
# 단어 사전을 생성합니다.
tokenizer.fit_on_texts(train_sentences)
# 문장을 시퀀스로 변환합니다.
train_sequences = tokenizer.texts_to_sequences(train_sentences)
validation_sequences = tokenizer.texts_to_sequences(validation_sentences)
# padding 처리를 수행하여 한 문장의 길이를 맞춥니다.
# maxlen은 120 단어로 지정하였습니다.
train_padded = pad_sequences(train_sequences, padding=padding_type, maxlen=max_length, truncating=trunc_type)
validation_padded = pad_sequences(validation_sequences, padding=padding_type, maxlen=max_length, truncating=trunc_type)
# 결과물 shape 확인
train_padded.shape
(2000, 120)
# 0번째 index 출력
train_padded[0]
array([101, 176, 1, 1, 54, 1, 782, 1, 95, 1, 1, 143, 188, 1, 1, 1, 1, 47, 9, 934, 101, 4, 1, 371, 87, 23, 17, 144, 1, 1, 1, 588, 454, 1, 71, 1, 1, 1, 10, 834, 4, 800, 12, 869, 1, 11, 643, 1, 1, 412, 4, 1, 1, 775, 54, 559, 1, 1, 1, 148, 303, 128, 1, 801, 1, 1, 599, 12, 1, 1, 834, 1, 143, 354, 188, 1, 1, 1, 42, 68, 1, 31, 11, 2, 1, 22, 2, 1, 138, 439, 9, 146, 1, 80, 1, 471, 1, 101, 1, 86, 1, 93, 1, 61, 1, 101, 8, 1, 644, 95, 1, 101, 1, 139, 164, 469, 11, 1, 46, 56], dtype=int32)
1
로 마킹되어 있는 단어들이 많이 보입니다. 1
로 마킹된 단어들은 OOV 토큰입니다.
# label을 numpy array 로 변환합니다.
train_labels = np.array(train_labels)
validation_labels = np.array(validation_labels)
모델
# model 생성
model = Sequential([
Embedding(vocab_size, embedding_dim, input_length=max_length),
Bidirectional(LSTM(64, return_sequences=True)),
Bidirectional(LSTM(64)),
Dense(32, activation='relu'),
Dense(16, activation='relu'),
Dense(6, activation='softmax')
])
컴파일시 loss
는 sparse_categorical_crossentropy
를 지정하였습니다 (별도의 원핫인코딩을 수행하지 않았기 때문).
# model compile
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
체크포인트를 생성합니다. val_loss 기준으로 가장 최저점의 체크포인트를 학습이 완료된 뒤 로드합니다.
checkpoint_path = 'bbc_checkpoint.ckpt'
checkpoint = ModelCheckpoint(checkpoint_path,
save_weights_only=True,
save_best_only=True,
monitor='val_loss',
verbose=1)
모델을 학습합니다. epochs
를 충분히 주어 원하는 val_loss
, val_acc
에 도달할 때까지 학습합니다.
history = model.fit(train_padded, train_labels,
validation_data=(validation_padded, validation_labels),
callbacks=[checkpoint],
epochs=30)
Epoch 1/30 61/63 [============================>.] - ETA: 0s - loss: 1.6741 - acc: 0.2039 Epoch 00001: val_loss improved from inf to 1.63404, saving model to bbc_checkpoint.ckpt 63/63 [==============================] - 5s 33ms/step - loss: 1.6717 - acc: 0.2075 - val_loss: 1.6340 - val_acc: 0.1911 Epoch 2/30 61/63 [============================>.] - ETA: 0s - loss: 1.4633 - acc: 0.3294 Epoch 00002: val_loss improved from 1.63404 to 1.37903, saving model to bbc_checkpoint.ckpt 63/63 [==============================] - 1s 20ms/step - loss: 1.4598 - acc: 0.3315 - val_loss: 1.3790 - val_acc: 0.3333 Epoch 3/30 61/63 [============================>.] - ETA: 0s - loss: 1.2494 - acc: 0.4252 Epoch 00003: val_loss improved from 1.37903 to 1.16279, saving model to bbc_checkpoint.ckpt 63/63 [==============================] - 1s 20ms/step - loss: 1.2487 - acc: 0.4250 - val_loss: 1.1628 - val_acc: 0.4444 Epoch 4/30 61/63 [============================>.] - ETA: 0s - loss: 1.0220 - acc: 0.5277 Epoch 00004: val_loss improved from 1.16279 to 1.02034, saving model to bbc_checkpoint.ckpt 63/63 [==============================] - 1s 20ms/step - loss: 1.0172 - acc: 0.5305 - val_loss: 1.0203 - val_acc: 0.5111 Epoch 5/30 61/63 [============================>.] - ETA: 0s - loss: 0.9038 - acc: 0.5758 Epoch 00005: val_loss did not improve from 1.02034 63/63 [==============================] - 1s 20ms/step - loss: 0.8989 - acc: 0.5805 - val_loss: 1.0432 - val_acc: 0.5422 Epoch 6/30 62/63 [============================>.] - ETA: 0s - loss: 0.7046 - acc: 0.7122 Epoch 00006: val_loss improved from 1.02034 to 0.72509, saving model to bbc_checkpoint.ckpt 63/63 [==============================] - 1s 20ms/step - loss: 0.7021 - acc: 0.7140 - val_loss: 0.7251 - val_acc: 0.6756 Epoch 7/30 61/63 [============================>.] - ETA: 0s - loss: 0.5501 - acc: 0.7859 Epoch 00007: val_loss improved from 0.72509 to 0.70795, saving model to bbc_checkpoint.ckpt 63/63 [==============================] - 1s 20ms/step - loss: 0.5455 - acc: 0.7890 - val_loss: 0.7079 - val_acc: 0.6711 Epoch 8/30 62/63 [============================>.] - ETA: 0s - loss: 0.3505 - acc: 0.8720 Epoch 00008: val_loss improved from 0.70795 to 0.45064, saving model to bbc_checkpoint.ckpt 63/63 [==============================] - 1s 20ms/step - loss: 0.3484 - acc: 0.8730 - val_loss: 0.4506 - val_acc: 0.8311 Epoch 9/30 61/63 [============================>.] - ETA: 0s - loss: 0.2467 - acc: 0.9180 Epoch 00009: val_loss did not improve from 0.45064 63/63 [==============================] - 1s 20ms/step - loss: 0.2440 - acc: 0.9185 - val_loss: 0.4995 - val_acc: 0.8533 Epoch 10/30 61/63 [============================>.] - ETA: 0s - loss: 0.2481 - acc: 0.9191 Epoch 00010: val_loss improved from 0.45064 to 0.43606, saving model to bbc_checkpoint.ckpt 63/63 [==============================] - 1s 20ms/step - loss: 0.2450 - acc: 0.9200 - val_loss: 0.4361 - val_acc: 0.8844 Epoch 11/30 62/63 [============================>.] - ETA: 0s - loss: 0.1306 - acc: 0.9637 Epoch 00011: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.1298 - acc: 0.9640 - val_loss: 0.5338 - val_acc: 0.8667 Epoch 12/30 61/63 [============================>.] - ETA: 0s - loss: 0.1060 - acc: 0.9708 Epoch 00012: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.1040 - acc: 0.9715 - val_loss: 0.4474 - val_acc: 0.8844 Epoch 13/30 61/63 [============================>.] - ETA: 0s - loss: 0.1686 - acc: 0.9426 Epoch 00013: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.1688 - acc: 0.9425 - val_loss: 0.7164 - val_acc: 0.8044 Epoch 14/30 62/63 [============================>.] - ETA: 0s - loss: 0.1827 - acc: 0.9471 Epoch 00014: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.1816 - acc: 0.9475 - val_loss: 0.5031 - val_acc: 0.8667 Epoch 15/30 61/63 [============================>.] - ETA: 0s - loss: 0.0930 - acc: 0.9708 Epoch 00015: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0920 - acc: 0.9710 - val_loss: 0.5199 - val_acc: 0.8622 Epoch 16/30 61/63 [============================>.] - ETA: 0s - loss: 0.0404 - acc: 0.9898 Epoch 00016: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0442 - acc: 0.9890 - val_loss: 0.4765 - val_acc: 0.9022 Epoch 17/30 61/63 [============================>.] - ETA: 0s - loss: 0.0504 - acc: 0.9836 Epoch 00017: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0494 - acc: 0.9840 - val_loss: 0.5070 - val_acc: 0.8889 Epoch 18/30 62/63 [============================>.] - ETA: 0s - loss: 0.0448 - acc: 0.9854 Epoch 00018: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0444 - acc: 0.9855 - val_loss: 0.5073 - val_acc: 0.8844 Epoch 19/30 61/63 [============================>.] - ETA: 0s - loss: 0.0404 - acc: 0.9867 Epoch 00019: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0405 - acc: 0.9865 - val_loss: 0.7229 - val_acc: 0.8578 Epoch 20/30 61/63 [============================>.] - ETA: 0s - loss: 0.0649 - acc: 0.9790 Epoch 00020: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0636 - acc: 0.9795 - val_loss: 0.5101 - val_acc: 0.9022 Epoch 21/30 61/63 [============================>.] - ETA: 0s - loss: 0.0333 - acc: 0.9898 Epoch 00021: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0328 - acc: 0.9900 - val_loss: 0.6735 - val_acc: 0.8844 Epoch 22/30 62/63 [============================>.] - ETA: 0s - loss: 0.0218 - acc: 0.9934 Epoch 00022: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0217 - acc: 0.9935 - val_loss: 0.5494 - val_acc: 0.9067 Epoch 23/30 61/63 [============================>.] - ETA: 0s - loss: 0.0375 - acc: 0.9887 Epoch 00023: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0378 - acc: 0.9885 - val_loss: 0.5719 - val_acc: 0.9111 Epoch 24/30 61/63 [============================>.] - ETA: 0s - loss: 0.0800 - acc: 0.9734 Epoch 00024: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0804 - acc: 0.9730 - val_loss: 0.6355 - val_acc: 0.8844 Epoch 25/30 61/63 [============================>.] - ETA: 0s - loss: 0.0606 - acc: 0.9872 Epoch 00025: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0592 - acc: 0.9875 - val_loss: 0.5286 - val_acc: 0.9067 Epoch 26/30 61/63 [============================>.] - ETA: 0s - loss: 0.0158 - acc: 0.9954 Epoch 00026: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0162 - acc: 0.9950 - val_loss: 0.5403 - val_acc: 0.9022 Epoch 27/30 61/63 [============================>.] - ETA: 0s - loss: 0.0128 - acc: 0.9974 Epoch 00027: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0125 - acc: 0.9975 - val_loss: 0.5612 - val_acc: 0.9022 Epoch 28/30 61/63 [============================>.] - ETA: 0s - loss: 0.0074 - acc: 0.9990 Epoch 00028: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 20ms/step - loss: 0.0073 - acc: 0.9990 - val_loss: 0.5918 - val_acc: 0.9022 Epoch 29/30 62/63 [============================>.] - ETA: 0s - loss: 0.0087 - acc: 0.9990 Epoch 00029: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 19ms/step - loss: 0.0087 - acc: 0.9990 - val_loss: 0.6322 - val_acc: 0.8978 Epoch 30/30 62/63 [============================>.] - ETA: 0s - loss: 0.0070 - acc: 0.9985 Epoch 00030: val_loss did not improve from 0.43606 63/63 [==============================] - 1s 19ms/step - loss: 0.0070 - acc: 0.9985 - val_loss: 0.6204 - val_acc: 0.9022
학습이 완료된 뒤 저장한 체크포인트를 load 합니다.
# checkpoint 로드
model.load_weights(checkpoint_path)
validation_padded
와 validation_labels
로 최종 성능평가를 수행합니다.
# 모델 평가
model.evaluate(validation_padded, validation_labels)
8/8 [==============================] - 0s 8ms/step - loss: 0.4361 - acc: 0.8844
[0.43605977296829224, 0.8844444155693054]
학습 결과 시각화
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2)
fig.set_size_inches(10, 4)
axes[0].plot(history.history['loss'], color='#5A98BF', alpha=0.5, linestyle=':', label='loss')
axes[0].plot(history.history['val_loss'], color='#5A98BF', linestyle='-', label='val_loss')
axes[0].set_xlabel('Epochs', fontsize=10)
axes[0].set_ylabel('Loss', fontsize=10)
axes[0].set_title('Losses')
axes[0].tick_params(axis='both', which='major', labelsize=8)
axes[0].tick_params(axis='both', which='minor', labelsize=6)
axes[0].legend()
axes[1].plot(history.history['acc'], color='#F2294E', alpha=0.3, linestyle=':', label='acc')
axes[1].plot(history.history['val_acc'], color='#F2294E', linestyle='-', label='val_acc')
axes[1].set_xlabel('Epochs')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy')
axes[1].tick_params(axis='both', which='major', labelsize=8)
axes[1].tick_params(axis='both', which='minor', labelsize=6)
axes[1].legend()
plt.show()
댓글남기기