TensorFlow RNN Text 생성 (셰익스피어 글 생성)
Jun 9, 2020

텐서플로우 공식 튜토리얼인 순환 신경망을 활용한 문자열 생성에 대한 클론 코드입니다. 셰익스피어 글 데이터셋을 활용하여 인공지능 모델을 학습시키고, 셰익스피어 스타일의 글을 생성할 수 있는 모델을 만들어 보도록 하겠습니다.

데이터셋은 Windowed Dataset으로 구성하며, 모델은 Embedding Layer와 LSTM Layer를 사용하여 구성합니다.

튜토리얼 영상

텐서플로우 RNN 텍스트 생성

본 튜토리얼은 텐서플로우 공식 도큐먼트 튜토리얼에 대한 클론 코드입니다.

import tensorflow as tf

import numpy as np
import os
import time
%%javascript
IPython.OutputArea.auto_scroll_threshold = 20

셰익스피어 데이터셋 다운로드

구글 데이터셋 서버로부터 shakespear.txt 데이터셋을 다운로드 받습니다.

path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print(text[:200])
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you
print(repr(text[:200]))
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you'
# 총 문장의 길이
len(text)
1115394

고유 캐릭터 수를 출력합니다.

vocab = sorted(set(text))
vocab[:10]
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3']
len(vocab)
65

텍스트 전처리 (preprocessing)

STEP 1. Character 사전 만들기

Character를 index로 변환하는 사전을 만듭니다.

char2idx = {u: i for i, u in enumerate(vocab)}
char2idx
{'\n': 0,
 ' ': 1,
 '!': 2,
 '$': 3,
 '&': 4,
 "'": 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '3': 9,
 ':': 10,
 ';': 11,
 '?': 12,
 'A': 13,
 'B': 14,
 'C': 15,
 'D': 16,
 'E': 17,
 'F': 18,
 'G': 19,
 'H': 20,
 'I': 21,
 'J': 22,
 'K': 23,
 'L': 24,
 'M': 25,
 'N': 26,
 'O': 27,
 'P': 28,
 'Q': 29,
 'R': 30,
 'S': 31,
 'T': 32,
 'U': 33,
 'V': 34,
 'W': 35,
 'X': 36,
 'Y': 37,
 'Z': 38,
 'a': 39,
 'b': 40,
 'c': 41,
 'd': 42,
 'e': 43,
 'f': 44,
 'g': 45,
 'h': 46,
 'i': 47,
 'j': 48,
 'k': 49,
 'l': 50,
 'm': 51,
 'n': 52,
 'o': 53,
 'p': 54,
 'q': 55,
 'r': 56,
 's': 57,
 't': 58,
 'u': 59,
 'v': 60,
 'w': 61,
 'x': 62,
 'y': 63,
 'z': 64}

index -> Character로 변환하는 사전을 만듭니다.

idx2char = np.array(vocab)
idx2char[49]
'k'

Step 2. 텍스트 전체를 int로 변환합니다.

text[:200]
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you'
char2idx['i']
47
text_as_int = np.array([char2idx[c] for c in text])
len(text_as_int)
1115394
text_as_int[:10]
array([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])

변환된 부분을 확인합니다. (처음 5개)

# 원문
text[:5]
'First'
# 변환된 sequence
text_as_int[:5]
array([18, 47, 56, 57, 58])
# 각각의 단어사전으로 출력
char2idx['F'], char2idx['i'], char2idx['r'], char2idx['s'], char2idx['t']
(18, 47, 56, 57, 58)

Step 3. X, Y 데이터셋 생성하기

# 단일 입력에 대해 원하는 문장의 최대 길이
window_size = 100
shuffle_buffer = 10000
batch_size=64

Windowed Dataset을 만듭니다.

def windowed_dataset(series, window_size, shuffle_buffer, batch_size):
    series = tf.expand_dims(series, -1)
    ds = tf.data.Dataset.from_tensor_slices(series)
    ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda x: x.batch(window_size + 1))
    ds = ds.shuffle(shuffle_buffer)
    ds = ds.map(lambda x: (x[:-1], x[1:]))
    return ds.batch(batch_size).prefetch(1)
train_data = windowed_dataset(np.array(text_as_int), window_size, shuffle_buffer, batch_size)
# 문자로 된 어휘 사전의 크기
vocab_size = len(vocab)
vocab_size
65
# 임베딩 차원
embedding_dim = 256

# RNN 유닛(unit) 개수
rnn_units = 1024
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (64, None, 256)           16640     
_________________________________________________________________
lstm (LSTM)                  (64, None, 1024)          5246976   
_________________________________________________________________
dense (Dense)                (64, None, 65)            66625     
=================================================================
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________

체크포인트를 생성합니다.

# 체크포인트가 저장될 디렉토리
checkpoint_path = './models/my_checkpt.ckpt'

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path,
    save_weights_only=True, 
    save_best_only=True,
    monitor='loss', 
    verbose=1, 
)

Loss function을 정의합니다.

def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
model.compile(optimizer='adam', loss=loss, metrics=['acc'])
model.fit(train_data, 
          epochs=10, 
          steps_per_epoch=1720, 
          callbacks=[checkpoint_callback])
Epoch 1/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.7094 - acc: 0.8217
Epoch 00001: loss improved from inf to 0.70912, saving model to ./models/my_checkpt.ckpt
1720/1720 [==============================] - 51s 30ms/step - loss: 0.7091 - acc: 0.8217
Epoch 2/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.3121 - acc: 0.9299
Epoch 00002: loss improved from 0.70912 to 0.31212, saving model to ./models/my_checkpt.ckpt
1720/1720 [==============================] - 51s 29ms/step - loss: 0.3121 - acc: 0.9299
Epoch 3/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.2816 - acc: 0.9363
Epoch 00003: loss improved from 0.31212 to 0.28167, saving model to ./models/my_checkpt.ckpt
1720/1720 [==============================] - 51s 29ms/step - loss: 0.2817 - acc: 0.9363
Epoch 4/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.2805 - acc: 0.9365
Epoch 00004: loss improved from 0.28167 to 0.28046, saving model to ./models/my_checkpt.ckpt
1720/1720 [==============================] - 51s 30ms/step - loss: 0.2805 - acc: 0.9365
Epoch 5/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.2882 - acc: 0.9353
Epoch 00005: loss did not improve from 0.28046
1720/1720 [==============================] - 51s 29ms/step - loss: 0.2883 - acc: 0.9353
Epoch 6/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.2803 - acc: 0.9371
Epoch 00006: loss improved from 0.28046 to 0.28026, saving model to ./models/my_checkpt.ckpt
1720/1720 [==============================] - 51s 29ms/step - loss: 0.2803 - acc: 0.9371
Epoch 7/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.2924 - acc: 0.9348
Epoch 00007: loss did not improve from 0.28026
1720/1720 [==============================] - 50s 29ms/step - loss: 0.2924 - acc: 0.9348
Epoch 8/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.2993 - acc: 0.9336
Epoch 00008: loss did not improve from 0.28026
1720/1720 [==============================] - 51s 30ms/step - loss: 0.2993 - acc: 0.9336
Epoch 9/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.2971 - acc: 0.9342
Epoch 00009: loss did not improve from 0.28026
1720/1720 [==============================] - 51s 30ms/step - loss: 0.2970 - acc: 0.9342
Epoch 10/10
1719/1720 [============================>.] - ETA: 0s - loss: 0.3014 - acc: 0.9332
Epoch 00010: loss did not improve from 0.28026
1720/1720 [==============================] - 51s 30ms/step - loss: 0.3014 - acc: 0.9332
<tensorflow.python.keras.callbacks.History at 0x7fd01070b780>

예측을 위한 모델 재정의

batch_size -> 1로 변경합니다.

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[1, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
])
model.load_weights(checkpoint_path)
<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fcfc419ad68>
model.build(tf.TensorShape([1, None]))
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (1, None, 256)            16640     
_________________________________________________________________
lstm_2 (LSTM)                (1, None, 1024)           5246976   
_________________________________________________________________
dense_2 (Dense)              (1, None, 65)             66625     
=================================================================
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________

generate_text 함수를 활용하여, 문자를 연속적으로 예측합니다.

def generate_text(model, start_string):
    # 평가 단계 (학습된 모델을 사용하여 텍스트 생성)

    # 생성할 문자의 수
    num_generate = 1000

    # 시작 문자열을 숫자로 변환(벡터화)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # 결과를 저장할 빈 문자열
    text_generated = []

    # 온도가 낮으면 더 예측 가능한 텍스트가 됩니다.
    # 온도가 높으면 더 의외의 텍스트가 됩니다.
    # 최적의 세팅을 찾기 위한 실험
    temperature = 1.0

    # 여기에서 배치 크기 == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # 배치 차원 제거
        predictions = tf.squeeze(predictions, 0)

        # 범주형 분포를 사용하여 모델에서 리턴한 단어 예측
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # 예측된 단어를 다음 입력으로 모델에 전달
        # 이전 은닉 상태와 함께
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

최종 결과물 출력

print(generate_text(model, start_string=u"ROMEO: "))
ROMEO: what news be true,
Poor queen and this is he than for.

CLARENCE:
Feak man confeigned friend,
And their true sovereign, whom they must obey?
Nay, whom they shall obey, and love thee too,
Unless;
So servitor.

CLARENCE:
Belike the e once agree,
Is Clarence, Edward's brother, reception brief and Clarence commore he that hope I have of heavenly bliss,
That I am sour words him from thence the Thracian fatal steeds,
So we, well cover'd with tears,
And fault, you should have foull him as me of all kindness at my hand
That your estate requires and mine can yield.

WARWICK:
Henry now lives in Scotland at his ease,
Where comes the king.

Scotland him ere's Clarence, welcome unto Warwick;
And welcome, Somerset: I hold it cowardice
To rest mistrustful where a noble heart
Hath pawn'd an open hand in sign of love;
Else might I think that Clarence, Edwargis it flay:
My queen in person will Well deserves it;
And here, to pledge my vow, I give my hand.

KING LEWIS XI:
Why stay we now? My crown is cal 


관련 글 더보기

- Attention을 활용한 Seq2Seq 모델 생성과 데이터셋 구성

- TensorFlow LSTM layer 활용법

- 텐서플로우(tensorflow) 파이참(pycharm)에서 dll 오류 해결

- TensorFlow2.0 GradientTape의 활용법

- TensorFlow2.0으로 오토인코더 구현 (MNIST)

데이터 분석, 머신러닝, 딥러닝의 대중화를 꿈 꿉니다.