TensorFlow 2.0 - 단어 토큰화, Embedding, LSTM layer를 활용한 뉴스 데이터 sarcasm 판단
Apr 4, 2020

캐글의 뉴스의 Sarcasm 에 대한 판단을 해주는 딥러닝 모델을 tensorflow 2.0을 활용하여 만들어 보겠습니다.

sarcastic (sarcasm)

  • 미국식 [sɑːrˈk-] 영국식 [sɑːˈkæstɪk]
  • 뜻: 빈정대는, 비꼬는

출처: 네이버사전

개요

뉴스 기사의 헤드라인(영문장)을 통하여 sarcasm (비꼬는 기사) 인지 아닌지 여부를 판단하는 classification 문제입니다.

0. 필요한 라이브러리 import

In [1]:
import json
import tensorflow as tf
import numpy as np
import urllib
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Embedding, LSTM, Bidirectional
from tensorflow.keras.models import Sequential

1. sarcasm 데이터 로드

In [2]:
url = 'https://storage.googleapis.com/download.tensorflow.org/data/sarcasm.json'
urllib.request.urlretrieve(url, 'sarcasm.json')
Out[2]:
('sarcasm.json', <http.client.HTTPMessage at 0x7fbf3b535668>)

json.load()를 활용하여 sarcasm 데이터를 로드합니다.

In [3]:
with open('sarcasm.json', 'r') as f:
    data = json.load(f)
In [4]:
data[:5]
Out[4]:
[{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
  'headline': "former versace store clerk sues over secret 'black code' for minority shoppers",
  'is_sarcastic': 0},
 {'article_link': 'https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365',
  'headline': "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
  'is_sarcastic': 0},
 {'article_link': 'https://local.theonion.com/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697',
  'headline': "mom starting to fear son's web series closest thing she will have to grandchild",
  'is_sarcastic': 1},
 {'article_link': 'https://politics.theonion.com/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302',
  'headline': 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas',
  'is_sarcastic': 1},
 {'article_link': 'https://www.huffingtonpost.com/entry/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb',
  'headline': 'j.k. rowling wishes snape happy birthday in the most magical way',
  'is_sarcastic': 0}]

article_link에는 신문기사의 링크가, headline에는 신문 기사의 헤드라인이, is_sarcastic에는 sarcasm 여부를 판단하는 label이 표기되어 있습니다.

2. feature, label 정의

In [5]:
sentences = []
labels = []
In [6]:
for d in data:
    sentences.append(d['headline'])
    labels.append(d['is_sarcastic'])

3. train, validation dataset 분할

In [7]:
# train dataset을 사용할 ratio를 정의합니다.
train_ratio = 0.8

train_size = int(len(data) * train_ratio)
In [8]:
train_size, len(data)
Out[8]:
(21367, 26709)
In [9]:
# train 분할
train_sentences = sentences[:train_size]
valid_sentences = sentences[train_size:]
# label 분할
train_labels = labels[:train_size]
valid_labels = labels[train_size:]

4. 토근화 (Tokenize)

vocab_size 는 Token화 진행시 최대 빈도숫자가 높은 1000개의 단어만을 활용하고 나머지는 처리하겠다는 의미입니다.

In [10]:
vocab_size = 1000
In [11]:
token = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
In [12]:
token.fit_on_texts(sentences)
In [13]:
word_index = token.word_index
In [14]:
word_index
Out[14]:
{'<OOV>': 1,
 'to': 2,
 'of': 3,
 'the': 4,
 'in': 5,
 'for': 6,
 'a': 7,
 'on': 8,
 'and': 9,
 'with': 10,
 ...
 ...
 ...
 'explains': 990,
 'table': 991,
 'energy': 992,
 'users': 993,
 'feeling': 994,
 'sales': 995,
 'colbert': 996,
 'apparently': 997,
 "let's": 998,
 'amazing': 999,
 'went': 1000,
 ...}

단어: index로 맵핑된 dict 가 완성되었음을 확인할 수 있습니다.

In [15]:
word_index['party']
Out[15]:
149

5. Sequence로 변환

In [16]:
train_sequences = token.texts_to_sequences(train_sentences)
valid_sequences = token.texts_to_sequences(valid_sentences)

단어로 이루어진 sentencesTokenizer를 통해 기계가 알아들을 수 있는 numerical value로 변환하였습니다.

In [17]:
train_sentences[:5]
Out[17]:
["former versace store clerk sues over secret 'black code' for minority shoppers",
 "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
 "mom starting to fear son's web series closest thing she will have to grandchild",
 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas',
 'j.k. rowling wishes snape happy birthday in the most magical way']
In [18]:
train_sequences[:5]
Out[18]:
[[308, 1, 679, 1, 1, 48, 382, 1, 1, 6, 1, 1],
 [4, 1, 1, 1, 22, 2, 166, 1, 416, 1, 6, 258, 9, 1],
 [145, 838, 2, 907, 1, 1, 582, 1, 221, 143, 39, 46, 2, 1],
 [1, 36, 224, 400, 2, 1, 29, 319, 22, 10, 1, 1, 1, 968],
 [767, 719, 1, 908, 1, 623, 594, 5, 4, 95, 1, 92]]

6. 문장의 길이 맞추기 (pad_sequences)

학습을 위해서는 input의 길이가 동일 해야합니다.

지금의 sequences는 길이가 들쭉날쭉합니다.

pad_sequences를 통해 길이를 맞춰주고, 길이가 긴 문장을 자르거나, 길이가 짧은 문장은 padding 처리를 해줄 수 있습니다.

padding 처리를 한다는 말은 0이나 특정 constant로 채워 준다는 의미 이기도 합니다.

pad_sequences 옵션 값

  • truncating: 'post' / 'pre'- 문장의 길이가 maxlen보다 길 때, 뒷 / 앞 부분을 잘라줍니다.
  • padding: 'post' / 'pre' - 문장의 길이가 maxlen보다 길 때, 뒷 / 앞 부분을 잘라줍니다.
  • maxlen: 최대 문장 길이를 정의합니다.
In [19]:
_truncating = 'post'
_padding = 'post'
_maxlen = 120
In [20]:
train_padded = pad_sequences(train_sequences, truncating=_truncating, padding=_padding, maxlen=_maxlen)
valid_padded = pad_sequences(valid_sequences, truncating=_truncating, padding=_padding, maxlen=_maxlen)

7. label을 np.array로 변환

list 타입은 허용하지 않기 때문에, labelsnp.array로 변환합니다.

In [21]:
train_labels = np.asarray(train_labels)
valid_labels = np.asarray(valid_labels)
In [22]:
train_labels, valid_labels
Out[22]:
(array([0, 0, 1, ..., 0, 1, 1]), array([1, 1, 1, ..., 0, 0, 0]))

8. 모델링 (Modeling)

현재 vocab_size = 1000으로 정의되어 있기때문에 우리의 단어들은 1000차원 공간안에 정의되어 있다고 볼 수 있습니다.

우리는 이를 16차원으로 내려 Data Sparsity를 해결하고 효율적으로 학습할 수 있도록 합니다.

In [23]:
embedding_dim = 16
In [24]:
model = Sequential([
        Embedding(vocab_size, embedding_dim, input_length=_maxlen),
        Bidirectional(LSTM(32)),
        Dense(24, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
In [25]:
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 120, 16)           16000     
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                12544     
_________________________________________________________________
dense (Dense)                (None, 24)                1560      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
=================================================================
Total params: 30,129
Trainable params: 30,129
Non-trainable params: 0
_________________________________________________________________

9. Callback 정의 - validation best weight를 저장하기 위함

validation performance가 갱신이 될 때마다 저장합니다. (나중에 이를 load 하여 prediction할 예정입니다)

In [26]:
checkpoint_path = 'best_performed_model.ckpt'
checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, 
                                                save_weights_only=True, 
                                                save_best_only=True, 
                                                monitor='val_loss',
                                                verbose=1)

adam optimizer를 사용하며, 0, 1을 맞추는 것이므로 binary_crossentropy를 사용합니다.

In [27]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
In [28]:
history = model.fit(train_padded, train_labels, 
                    validation_data=(valid_padded, valid_labels),
                    callbacks=[checkpoint],
                    epochs=20, 
                    verbose=2)
Train on 21367 samples, validate on 5342 samples
Epoch 1/20

Epoch 00001: val_loss improved from inf to 0.38502, saving model to best_performed_model.ckpt
21367/21367 - 14s - loss: 0.4568 - accuracy: 0.7666 - val_loss: 0.3850 - val_accuracy: 0.8231
Epoch 2/20

Epoch 00002: val_loss improved from 0.38502 to 0.38251, saving model to best_performed_model.ckpt
21367/21367 - 12s - loss: 0.3507 - accuracy: 0.8391 - val_loss: 0.3825 - val_accuracy: 0.8210
Epoch 3/20

Epoch 00003: val_loss improved from 0.38251 to 0.36455, saving model to best_performed_model.ckpt
21367/21367 - 12s - loss: 0.3296 - accuracy: 0.8521 - val_loss: 0.3645 - val_accuracy: 0.8287
Epoch 4/20

Epoch 00004: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.3169 - accuracy: 0.8585 - val_loss: 0.3683 - val_accuracy: 0.8362
Epoch 5/20

Epoch 00005: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.3081 - accuracy: 0.8603 - val_loss: 0.3691 - val_accuracy: 0.8347
Epoch 6/20

Epoch 00006: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.3016 - accuracy: 0.8678 - val_loss: 0.3703 - val_accuracy: 0.8355
Epoch 7/20

Epoch 00007: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2964 - accuracy: 0.8699 - val_loss: 0.3706 - val_accuracy: 0.8328
Epoch 8/20

Epoch 00008: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2893 - accuracy: 0.8736 - val_loss: 0.3983 - val_accuracy: 0.8291
Epoch 9/20

Epoch 00009: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2819 - accuracy: 0.8770 - val_loss: 0.3820 - val_accuracy: 0.8353
Epoch 10/20

Epoch 00010: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2745 - accuracy: 0.8796 - val_loss: 0.3855 - val_accuracy: 0.8353
Epoch 11/20

Epoch 00011: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2653 - accuracy: 0.8859 - val_loss: 0.3906 - val_accuracy: 0.8338
Epoch 12/20

Epoch 00012: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2573 - accuracy: 0.8883 - val_loss: 0.4042 - val_accuracy: 0.8345
Epoch 13/20

Epoch 00013: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2510 - accuracy: 0.8930 - val_loss: 0.4072 - val_accuracy: 0.8276
Epoch 14/20

Epoch 00014: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2429 - accuracy: 0.8959 - val_loss: 0.4236 - val_accuracy: 0.8311
Epoch 15/20

Epoch 00015: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2362 - accuracy: 0.8981 - val_loss: 0.4175 - val_accuracy: 0.8300
Epoch 16/20

Epoch 00016: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2287 - accuracy: 0.9049 - val_loss: 0.4293 - val_accuracy: 0.8291
Epoch 17/20

Epoch 00017: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2214 - accuracy: 0.9064 - val_loss: 0.4573 - val_accuracy: 0.8287
Epoch 18/20

Epoch 00018: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2138 - accuracy: 0.9103 - val_loss: 0.4570 - val_accuracy: 0.8274
Epoch 19/20

Epoch 00019: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2082 - accuracy: 0.9140 - val_loss: 0.4661 - val_accuracy: 0.8313
Epoch 20/20

Epoch 00020: val_loss did not improve from 0.36455
21367/21367 - 12s - loss: 0.2007 - accuracy: 0.9170 - val_loss: 0.4926 - val_accuracy: 0.8255

10. best model의 weight load

In [29]:
model.load_weights(checkpoint_path)
Out[29]:
<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fbf3b991780>

이제 새로운 데이터에 대하여 똑같이 토큰화 - 텍스트를 시퀀스 변환 - pad_sequence 스텝으로 처리한 후

우리가 정의한 model로 prediction을 할 수 있습니다.

11. 시각화

In [30]:
import matplotlib.pyplot as plt

%matplotlib inline

Loss

In [31]:
plt.figure(figsize=(12, 6))
plt.plot(np.arange(20)+1, history.history['loss'], label='Loss')
plt.plot(np.arange(20)+1, history.history['val_loss'], label='Validation Loss')
plt.title('losses over training', fontsize=20)

plt.xlabel('epochs', fontsize=15)
plt.ylabel('loss', fontsize=15)

plt.legend()
plt.show()

Accuracy

In [32]:
plt.figure(figsize=(12, 6))
plt.plot(np.arange(20)+1, history.history['accuracy'], label='Accuracy')
plt.plot(np.arange(20)+1, history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy over training', fontsize=20)

plt.xlabel('epochs', fontsize=15)
plt.ylabel('Accuracy', fontsize=15)

plt.legend()
plt.show()

정리

  • epoch=2 이후로는 validation loss가 증가하는 모습입니다. (overfitting 문제가 있을 수 있습니다)
  • accuracy는 82%대에서 epoch이 늘어나도 크게 변동이 없어보입니다.
  • Dense Layer를 깊게 쌓아 보거나, Conv1D, 혹은 LSTM을 두겹으로 쌓는 등 모델 개선의 여지는 충분히 있습니다.
</div>

관련 글 더보기

- TensorFlow Datasets API 활용법

- tensorflow 2.0 ImageDataGenerator / Convolution Neural Network(CNN) 을 활용한 이미지 분류

- tensorflow 2.0 Dataset, batch, window, flat_map을 활용한 loader 만들기

- Digit Recognizer (Kaggle) - over 99% accuracy

- 딥러닝(LSTM)을 활용하여 삼성전자 주가 예측을 해보았습니다

데이터 분석, 머신러닝, 딥러닝의 대중화를 꿈 꿉니다.