Novel Text Generation Training RNN LSTM on Anna Karenina

Posted 2018-12-19

1. Goal:

  • Train a Recurrent Neural Network (RNN) of Long Short Term Memory (LSTM) cells on a corpus of text in order to generate new text on a character-by-character basis.
  • What makes RNN's interesting:
    • Able to find a pattern on ordered sequences / time series (dimension of time / memory)
    • Each timestep is dependent on what came before it
    • Much more flexible with inputs
    • Even if data is not a sequence, can learn to treat it as such
  • Concept:
    • Data is fed in ordered sequence (sliding window over the entire dataset)
    • Minimizing loss function is calculated on cross-entropy loss of predicted output at each timestep versus actual, ie. predict 2nd character given input of first character.
      • Note this means target data is just training shifted 1 timestep, so no need for labeled data per se
    • Memory comes into play when predicting the 3rd character given the first 2 prior ordered characters
    • Beyond prediction, can also use for generation. A hyperparameter can be tuned to produce more diverse or conservative results
  • This is based on a replication (with a few of my own expansions) from the Udacity AIND, specifically the Intro to RNN project on GitHub.
  • Training corpus is Anna Karenina full text from Project Gutenberg.
  • Great resources to learn about RNNs - building a visual intuition about what's going on at what level of the network was very central to finally understanding what's going on under the hood
In [1]:
import time
from collections import namedtuple

import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers.embeddings import Embedding
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
# from tqdm import tqdm_notebook

% matplotlib inline
plt.style.use('fivethirtyeight')
/usr/local/lib/python3.6/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

2. Load and Preprocess Data

2.1 Keras Tokenizer

  • This was a very simple tokenization based on characters so using the high-level Keras text preprocessing was fine https://keras.io/preprocessing/text/
  • In reality it would be better to tokenize by word at which point could do a lot more text pre-processing like word-stemming, stop words, parts of speech, etc. Of which NLTK is much more suited for http://www.nltk.org/
In [2]:
class TokenizedData:
    """Data class to convert file to sequences.
    
    Parameters
    ----------
    filename : string
        Relative path and filename to text
        
    Attributes
    ----------
    training_text : str
        Full text of training data
    t : Tokenizer
        Tokenizer that's already fitted on training_text
    index_to_char : dict
        Dict mapping index to character
    encoded_text : array, int
        Training text converted to sequence
    """
    def __init__(self, filename):
        # Read data
        with open(filename) as f:
            self.training_text = f.read()
        
        # Tokenizer
        self.t = Tokenizer(filters='', lower=False, char_level=True)
        self.t.fit_on_texts(self.training_text)
        self.index_to_char = dict(map(reversed, self.t.word_index.items())) 
        self.encoded_text = np.squeeze(self.t.texts_to_sequences(self.training_text))
    
    def sequence_to_str(self, sequence):
        """Map a list of sequences back to text string.
        
        Parameters
        ----------
        sequence : list, int
            Sequence to convert
        
        Returns
        -------
        mapped_text : str
            Converted sequence to text
        """
        mapped_text = ''.join([self.index_to_char[c] for c in sequence])
        return mapped_text
    
    def get_batches(self, batch_size, timesteps):
        """Yield data in batches.

        Parameters
        ---------
        batch_size : int
            Number of input sequences per batch
        timesteps : int
            Number of time steps per sequence (also width of sequence)

        Yields
        ------
        Batches of size batch_size X timesteps at a time.
        """
        # Get rid of extra characters not enough to fill sequence in the last
        # batch
        text = self.encoded_text
        chars_leftover = len(text) % (batch_size * timesteps)
        if chars_leftover > 0:
            text = text[:-chars_leftover]

        text = text.reshape(batch_size, -1)

        for cursor in range(0, text.shape[1], timesteps):
            x = text[:batch_size, cursor:cursor+timesteps]
            # Since y is x shifted over by 1, the last batch will need to be 
            # padded by a column of 0's
            y_padded = np.zeros(x.shape)
            y = text[:batch_size, cursor+1:cursor+timesteps+1]
            if (y_padded.shape == y.shape):
                yield x, y
            else:
                y_padded[:, :-1] = y
                yield x, y_padded

tokenized_data = TokenizedData('anna.txt')
In [3]:
# Set of all characters in the text
tokenized_data.t.word_index
Out[3]:
{' ': 1,
 'e': 2,
 't': 3,
 'a': 4,
 'o': 5,
 'n': 6,
 'h': 7,
 'i': 8,
 's': 9,
 'r': 10,
 'd': 11,
 'l': 12,
 '\n': 13,
 'u': 14,
 'w': 15,
 'c': 16,
 'm': 17,
 'g': 18,
 'y': 19,
 ',': 20,
 'f': 21,
 'p': 22,
 'b': 23,
 '.': 24,
 'v': 25,
 'k': 26,
 '"': 27,
 "'": 28,
 'I': 29,
 'A': 30,
 'x': 31,
 '-': 32,
 'T': 33,
 'S': 34,
 '?': 35,
 'L': 36,
 'H': 37,
 'W': 38,
 '!': 39,
 ';': 40,
 'B': 41,
 'V': 42,
 'j': 43,
 'q': 44,
 'K': 45,
 'Y': 46,
 'z': 47,
 'M': 48,
 'O': 49,
 'D': 50,
 'N': 51,
 'P': 52,
 'C': 53,
 '_': 54,
 'G': 55,
 'F': 56,
 'E': 57,
 ':': 58,
 'R': 59,
 '(': 60,
 ')': 61,
 '1': 62,
 '2': 63,
 'J': 64,
 'U': 65,
 '3': 66,
 '*': 67,
 '0': 68,
 '5': 69,
 '8': 70,
 '4': 71,
 '6': 72,
 '9': 73,
 '7': 74,
 '/': 75,
 'Q': 76,
 'Z': 77,
 'X': 78,
 '@': 79,
 '$': 80,
 '`': 81,
 '&': 82,
 '%': 83}
In [4]:
# Counts of each character in the text
tokenized_data.t.word_counts
Out[4]:
OrderedDict([('C', 796),
             ('h', 104874),
             ('a', 119810),
             ('p', 23288),
             ('t', 139018),
             ('e', 186592),
             ('r', 80402),
             (' ', 321702),
             ('1', 179),
             ('\n', 40263),
             ('H', 2077),
             ('y', 31223),
             ('f', 30986),
             ('m', 33518),
             ('i', 103979),
             ('l', 58913),
             ('s', 95717),
             ('k', 14285),
             (';', 1684),
             ('v', 18625),
             ('u', 40052),
             ('n', 110374),
             ('o', 114197),
             ('w', 35484),
             ('.', 19895),
             ('E', 491),
             ('g', 33033),
             ('c', 33922),
             ('O', 971),
             ('b', 19908),
             ("'", 6721),
             ('T', 2948),
             ('d', 68060),
             ('F', 494),
             (',', 31140),
             ('q', 1399),
             ('-', 3364),
             ('j', 1416),
             ('P', 876),
             ('S', 2901),
             ('A', 5303),
             ('"', 14012),
             ('Y', 1133),
             ('?', 2362),
             ('N', 918),
             ('!', 1717),
             ('D', 934),
             ('_', 706),
             ('I', 6254),
             ('x', 3422),
             (':', 439),
             ('M', 1015),
             ('W', 1817),
             ('(', 219),
             (')', 219),
             ('B', 1596),
             ('2', 107),
             ('R', 361),
             ('z', 1032),
             ('G', 613),
             ('L', 2168),
             ('3', 68),
             ('K', 1254),
             ('4', 35),
             ('V', 1439),
             ('5', 38),
             ('Z', 12),
             ('6', 33),
             ('7', 31),
             ('8', 37),
             ('U', 77),
             ('9', 33),
             ('0', 42),
             ('J', 89),
             ('Q', 22),
             ('`', 1),
             ('X', 3),
             ('*', 48),
             ('/', 31),
             ('&', 1),
             ('%', 1),
             ('@', 2),
             ('$', 2)])
In [5]:
# First 1000 characters of text
tokenized_data.training_text[:100]
Out[5]:
'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'
In [6]:
# Encoded sequence
sequence = tokenized_data.encoded_text[:100]
sequence
Out[6]:
array([53,  7,  4, 22,  3,  2, 10,  1, 62, 13, 13, 13, 37,  4, 22, 22, 19,
        1, 21,  4, 17,  8, 12,  8,  2,  9,  1,  4, 10,  2,  1,  4, 12, 12,
        1,  4, 12,  8, 26,  2, 40,  1,  2, 25,  2, 10, 19,  1, 14,  6,  7,
        4, 22, 22, 19,  1, 21,  4, 17,  8, 12, 19,  1,  8,  9,  1, 14,  6,
        7,  4, 22, 22, 19,  1,  8,  6,  1,  8,  3,  9,  1,  5, 15,  6, 13,
       15,  4, 19, 24, 13, 13, 57, 25,  2, 10, 19,  3,  7,  8,  6])
In [7]:
tokenized_data.sequence_to_str(sequence)
Out[7]:
'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

2.2 Generate Batches

In [8]:
# Sample array manipulation to get a feel of what's going on when we get the 
# batches
a = np.arange(20)
a = a.reshape(5,-1)
print(a)
for cursor in range(0, 4, 2):
    print(a[:5, cursor:cursor+2])
    print(a[:5, cursor+1:cursor+3])
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]
[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]
 [16 17]]
[[ 1  2]
 [ 5  6]
 [ 9 10]
 [13 14]
 [17 18]]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]
 [18 19]]
[[ 3]
 [ 7]
 [11]
 [15]
 [19]]
In [9]:
batches = tokenized_data.get_batches(10, 50)
x, y = next(batches)
In [10]:
x[:10, :10]
Out[10]:
array([[53,  7,  4, 22,  3,  2, 10,  1, 62, 13],
       [ 1,  4, 17,  1,  6,  5,  3,  1, 18,  5],
       [25,  8,  6, 24, 13, 13, 27, 46,  2,  9],
       [ 6,  1, 11, 14, 10,  8,  6, 18,  1,  7],
       [ 1,  8,  3,  1,  8,  9, 20,  1,  9,  8],
       [ 1, 29,  3,  1, 15,  4,  9, 13,  5,  6],
       [ 7,  2,  6,  1, 16,  5, 17,  2,  1, 21],
       [40,  1, 23, 14,  3,  1,  6,  5, 15,  1],
       [ 3,  1,  8,  9,  6, 28,  3, 24,  1, 33],
       [ 1,  9,  4,  8, 11,  1,  3,  5,  1,  7]])
In [11]:
y[:10, :10]
Out[11]:
array([[ 7,  4, 22,  3,  2, 10,  1, 62, 13, 13],
       [ 4, 17,  1,  6,  5,  3,  1, 18,  5,  8],
       [ 8,  6, 24, 13, 13, 27, 46,  2,  9, 20],
       [ 1, 11, 14, 10,  8,  6, 18,  1,  7,  8],
       [ 8,  3,  1,  8,  9, 20,  1,  9,  8, 10],
       [29,  3,  1, 15,  4,  9, 13,  5,  6, 12],
       [ 2,  6,  1, 16,  5, 17,  2,  1, 21,  5],
       [ 1, 23, 14,  3,  1,  6,  5, 15,  1,  9],
       [ 1,  8,  9,  6, 28,  3, 24,  1, 33,  7],
       [ 9,  4,  8, 11,  1,  3,  5,  1,  7,  2]])

3. Model

  • Part of the struggle in learning is was having a mental model of the following
    • RNN vs input/output layer
    • 1 stack of LSTM vs unrolled RNN of LSTM cells and how the states / outputs connected over timesteps
    • How states / outputs were calculated within 1 LSTM cell using forget, ignore, read gates
    • How implementation of batches affected the matrix calculations
  • These distinctions are important to keep in mind for the following subsections

3.1 Inputs to TF Graph

In [12]:
def model_inputs(batch_size, timesteps):
    """
    Build model inputs to the TF graph. Note that target is just the input 
    sequence shifted over one timestep.
    
    Parameters
    ----------
    batch_size : int
        Batch size
    timesteps : int
        Number of steps per sequence
    
    Returns
    -------
    x : placeholder tensor, int
        Training data input
        shape (None, timesteps)
    y : placeholder tensor, int
        Target data input (training data shifted over by 1 step)
        shape (None, timesteps)
    keep_prob : float
        Keep probability regularization within lstm cell
    """
    x = tf.placeholder(tf.int32, shape=(None, timesteps), name='input_sequence')
    y = tf.placeholder(tf.int32, shape=(None, timesteps), name='target_sequence')
    keep_prob = tf.placeholder(tf.float32, shape=(), name='keep_prob')
    
    return x, y, keep_prob

3.2 LSTM Stack

In [13]:
def model_lstm_stack(lstm_cell_size, num_layers, batch_size, keep_prob):
    """Build a stack of peephole lstm cells for 1 timestep.
    
    Parameters
    ---------
    lstm_cell_size : int
        Number of hidden neurons in each lstm cell
    num_layers : int
        Number of layers within each lstm cell
    keep_prob : placeholder scalar tensor
        Keep probability regularization within lstm cell (note this is for 
        output gate, not inpute or state)
    
    Returns
    -------
    lstm_stack : MultiRNNCell
        RNN stack composed sequentially of a number of lstm cells
        shape (lstm_cell_size)
    init_state : MultiRNNCell
        RNN stack accomodating batch_size and initialized values to 0
        shape (batch_size, lstm_cell_size)
    """
    def lstm_cell(lstm_cell_size, keep_prob):
        cell = tf.nn.rnn_cell.LSTMCell(lstm_cell_size, use_peepholes=True)
        cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=keep_prob)
        return cell
    
    lstm_stack = tf.nn.rnn_cell.MultiRNNCell(
                        [lstm_cell(lstm_cell_size, keep_prob) for _ in range(num_layers)])
    init_state = lstm_stack.zero_state(batch_size, dtype=tf.float32)
    
    return lstm_stack, init_state

3.3 Output Layer

  • RNN outputs at all timesteps train the same dense output layer weights and biases
In [14]:
def model_output(rnn_outputs, in_size, out_size):
    """Manually build softmax prediction layer using 1 set of weights for all 
    timestep outputs from RNN.
    
    Parameters
    ----------
    rnn_outputs :
        shape (batch_size, timestep, lstm_cell_size)
    in_size : int
        Input size into softmax layer - number of hidden neurons in each lstm
        cell
    out_size : int
        Output size for softmax layer - number of prediction classes

    Returns
    -------
    logits : tensor
        Logit outputs
        shape ((batch_size * timestep), out_size)
    predictions : tensor
        Softmax logits
        shape ((batch_size * timestep), out_size)
    """
    # Convert rnn_output shape (batch_size, timestep, lstm_cell_size) so that 
    # there is only 1 timestep per row. Ordering of the rows is by going thru
    # all the timesteps in 1 sequence, then next sequence in batch.  
    # Resulting shape ((batch_step * timestep), lstm_cell_size)
    
    # Concat so all timesteps in 1 sequence are concatenated with 
    # shape (batch_size, (timestep * lstm_cell_size))
    input_seq = tf.concat(rnn_outputs, axis=1)
    # Order is now correct with order by batch, timestep. Then reshape so 
    # there is only 1 timestep input per row
    input_step = tf.reshape(input_seq, [-1, in_size])
    
    # Variable scope to avoid name collision with softmax within LSTM cells
    with tf.variable_scope('output_layer'):                       
        dense_w = tf.Variable(tf.truncated_normal((in_size, out_size), 
                                                  dtype=tf.float32))
        dense_b = tf.Variable(tf.zeros((out_size)))
    
    logits = tf.add(tf.matmul(input_step, dense_w), 
                    dense_b, name='logits')
    predictions = tf.nn.softmax(logits, name='predictions')
    
    return logits, predictions

3.4 Loss

  • Calculate losses for the batch of timesteps tf.nn.softmax_cross_entropy_with_logits_v2
In [15]:
def model_loss(logits, y_one_hot):
    """Calculate losses.
    
    Parameters
    ----------
    logits : tensor
        Output layer logits from softmax
        shape ((batch_size, timestep), num_classes)
    y_one_hot : tensor
        Target labels of next token
        shape (batch_size, timestep, num_classes)
    
    Returns
    -------
    loss : scalar tensor
        Cross entropy loss for the batch of timesteps
    """
    # Reshape y to (batch_size, (timestep * num_classes)) to match logits
    y_reshaped = tf.reshape(y_one_hot, logits.get_shape())
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_reshaped, 
                                                      logits=logits), name='loss')
    return loss

3.5 Optimizer

  • Due to vanishing gradients (ie. gradients become really small as timesteps get larger and larger), LSTM cells are designed to deal with this problem.
  • Gradient clipping to a threshold is also employed to deal with exploding gradients.
    • https://arxiv.org/pdf/1211.5063.pdf (Pascanu 2013)
    • Thus, this is at a lower level due to breaking the optimization into 2 distnct steps (calculation of gradient & clipping tf.clip_by_global_norm, then applying gradient apply_gradients)
In [16]:
def model_optimizer(loss, learning_rate, grad_clip):
    """Low level building of optimizer using gradient clipping to help with 
    gradient overflow. LSTM solve underflow but gradient clipping is needed 
    for overflow. Gradient clipping clips values of multiple tensors by the 
    ratio of sum of their norms. 
    
    Parameters
    ----------
    loss : scalar tensor
    learning_rate : float
    grad_clip : scalar tensor
        Clipping ratio of the sum of their norms
    
    Returns
    -------
    optimizer_op : Optimizer Operation
        The Optimizer Operation that applies the specific gradients 
    
    """
    # Optimizer minimze = compute gradient and apply gradient
    # Compute gradients of trainable vars, clip them if too big
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    
    # Use AdamOptimizer and apply gradient
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    optimizer_op = optimizer.apply_gradients(zip(grads, tvars))
    
    return optimizer_op

3.6 CharRNN Class

  • Because 'memory' is maintained in RNN, in contract to other deep learning models, we need to track initial state (ie state of RNN before timestep 1 of the entire training corpus) and be able to set it to 0 for every new epoch. Otherwise the RNN would retain memory from the end of training corpus (ie. have a memory of the future).
  • Note the distinction of state versus output. State is the memory of the RNN, output at timestep is based on input and memory. Thus state is the 'long term memory' while output is the 'short term memory' of LSTM.
  • tf.nn.dynamic_rnn unrolls the lstm stack across timesteps
In [17]:
class CharRNN:
    """A character based RNN using LSTM cells and gradient clipping.
    
    Parameters
    ----------
    num_classes : int
        Number of prediction classes
    batch_size : int
        Number in each batch [default: 64]
    timesteps : int
        Number of timesteps within each sequence [default: 50]
    lstm_cell_size : int
        Number of neurons within each lstm cell [default: 128]
    num_layers : int
        Number of layers within each lstm cell [default: 2]
    learning_rate : float
        Optmizer learning rate [default: 0.001]
    grad_clip : float
        Gradient clipping ratio [default: 5.]
    sampling : bool
        1 timestep calculation at a time if True [default: False]
        
    Attributes
    ----------
    x : ndarray, int
        Training data inputs
    y : ndarray, int
        Target data
    batch_size : int
        Number in each batch
    timesteps : int
        Number of timesteps within each sequence
    init_state : MultiRNNCell
        Initial state of the RNN stack - in the context of 'unrolled' RNN this
        is the beginning of the first timestep, ie. state before anything has
        been fed thru the network
        shape (batch_size, lstm_cell_size) 
    final_state : MultiRNNCell
        Final state of the RNN stack - in the context of the 'unrolled' RNN 
        this is the state at the end of the last timestep, ie. state after the
        sequence has been feed thru the network
        shape (batch_size, lstm_cell_size) 
    logits : tensor
        Logit outputs 
        shape ((batch_size * timestep), out_size)
    predictions : tensor
        Softmax logits
        shape ((batch_size * timestep), out_size)
    batch_loss : float
        loss for the batch
    optimizer_op : Optimizer Operation
        The Optimizer Operation that applies the specific gradients 
    """
    def __init__(self, num_classes, batch_size=64, timesteps=50, lstm_cell_size=128,
                 num_layers=2, learning_rate=0.001, grad_clip=5., 
                 sampling=False):
        if sampling:
            self.batch_size, self.timesteps = 1, 1
        else:
            self.batch_size, self.timesteps = batch_size, timesteps
        
        tf.reset_default_graph()
        
        # Build inputs to TF graph
        self.x, self.y, self.keep_prob = model_inputs(self.batch_size, self.timesteps)
        x_one_hot = tf.one_hot(self.x, num_classes)
        y_one_hot = tf.one_hot(self.y, num_classes)
        
        # Build 1 lstm stack
        lstm_stack, self.init_state = model_lstm_stack(lstm_cell_size, 
                                                       num_layers, 
                                                       self.batch_size, 
                                                       self.keep_prob)
        # Build RNN by unrolling the lstm stack to timesteps
        rnn_outputs, self.final_state = tf.nn.dynamic_rnn(cell=lstm_stack,
                                                      inputs=x_one_hot,
                                                      initial_state=self.init_state)
        # Apply dense layer to RNN outputs to get logits and softmax prediction
        self.logits, self.predictions = model_output(rnn_outputs, 
                                                     lstm_cell_size, 
                                                     num_classes)
        # Calculate losses and optimizer for batch
        self.batch_loss = model_loss(self.logits, y_one_hot)
        self.optimizer_op = model_optimizer(self.batch_loss, 
                                            learning_rate, 
                                            grad_clip)

Training Loop

  • Train each batch to calculate final state, losses, and optimization
  • Make sure to pass the final_state to the init_state at the beginning of each batch
In [18]:
def train(model, tokenized_data, epochs, keep_prob, batch_size, timesteps, 
          print_every_n, save_every_n):
    """Training loop.
    
    Parameters
    ----------
    model : CharRNN
        CharRNN object initialized with model paramters
    tokenized_data : TokenizedData
        Tokenized data object
    epochs : int
        Number of epochs to train
    keep_prob : float
        Keep probability of layers in each lstm cell
    batch_size : int
        Batch size 
    timesteps : int
        Number of timesteps for each sequence
    print_every_n : int
        Print every n batches
    save_every_n : int
        Save every n batches
    
    Returns
    -------
    epoch_losses : list, float
        List of epoch losses
    
    """
    saver = tf.train.Saver(max_to_keep=20, save_relative_paths=True)
    epoch_losses = []
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        
        for e in range(epochs):
            # Initialize state of RNN to 0 at start of epoch
            new_state = sess.run(model.init_state)
            epoch_t1 = time.time()
            epoch_loss = 0
            
            # Progress bar
#             total_batches = int(len(tokenized_data.encoded_text) / (batch_size * timesteps))
#             pbar = tqdm_notebook(total=total_batches)
            
            for x, y in tokenized_data.get_batches(batch_size, timesteps):
#                 pbar.update(1)
                batch_t1 = time.time()
                
                feed = {model.x: x,
                        model.y: y, 
                        model.keep_prob: keep_prob,
                        model.init_state: new_state}
                batch_loss, new_state, optimizer_op = sess.run([model.batch_loss, 
                                                                model.final_state, 
                                                                model.optimizer_op], 
                                                               feed_dict=feed)
                batch_t2 = time.time()
                epoch_loss += batch_loss
            epoch_t2 = time.time()
            epoch_losses.append(epoch_loss)
            
            
            # Print output
            if (e % print_every_n == 0):
                print('Epoch {}/{}'.format(e+1, epochs))
                print('Epoch loss: {:.4f}'.format(epoch_loss))
                print('Epoch time taken: {:.3f}'.format(epoch_t2 - epoch_t1))
                print('Last batch loss: {:.4f}'.format(batch_loss))
                print('Last batch time taken: {:.3f}'.format(batch_t2 - batch_t1))

            # Save model weights to disk
            if (e % save_every_n == 0):
                saver.save(sess, 'checkpoints/e{}.ckpt'.format(e))
            
        # Save model weight for the very last iteration
        saver.save(sess, 'checkpoints/e{}.ckpt'.format(e))
    
    return epoch_losses

Main Loop

Hyperparameters

  • Some tips on hyperparameters https://github.com/karpathy/char-rnn#tips-and-tricks
  • lstm cell size - # of hidden units
  • number of hidden layers - 2 or 3
  • timestep - this will govern how far the gradient propagates back, ie. patterns / relationships in 1 sequence
  • Note I didn't do cross validation for this but approach would be the same, splitting the data where 95% of it is training, 5% for validation / test.
In [19]:
# GLOBAL VARIABLES
# Model hyperparameters
LSTM_CELL_SIZE = 512
NUM_LAYERS = 2
LEARNING_RATE = 0.001
GRAD_CLIP = 5

# Training hyperparameters
EPOCHS = 25
KEEP_PROB = 0.5
BATCH_SIZE = 100
TIMESTEPS = 100
PRINT_EVERY_N = 1
SAVE_EVERY_N = 1
In [20]:
def main():
    model = CharRNN(len(tokenized_data.t.word_index), 
                    batch_size=BATCH_SIZE, 
                    timesteps=TIMESTEPS,
                    lstm_cell_size=LSTM_CELL_SIZE,
                    num_layers=NUM_LAYERS,
                    learning_rate=LEARNING_RATE,
                    grad_clip=GRAD_CLIP)
    epoch_losses = train(model, 
                         tokenized_data,
                         epochs=EPOCHS, 
                         keep_prob=KEEP_PROB, 
                         batch_size=BATCH_SIZE,
                         timesteps=TIMESTEPS,
                         print_every_n=PRINT_EVERY_N, 
                         save_every_n=SAVE_EVERY_N)
    
    plt.figure(figsize=(6,6))
    plt.plot(epoch_losses)
    plt.xlabel('Epochs')
    plt.ylabel('Epoch losses')

main()
Epoch 1/25
Epoch loss: 538.2426
Epoch time taken: 60.179
Last batch loss: 2.3385
Last batch time taken: 0.303
Epoch 2/25
Epoch loss: 423.2642
Epoch time taken: 59.991
Last batch loss: 2.0537
Last batch time taken: 0.301
Epoch 3/25
Epoch loss: 377.1644
Epoch time taken: 59.983
Last batch loss: 1.8777
Last batch time taken: 0.297
Epoch 4/25
Epoch loss: 348.0678
Epoch time taken: 60.075
Last batch loss: 1.7771
Last batch time taken: 0.307
Epoch 5/25
Epoch loss: 326.4816
Epoch time taken: 59.961
Last batch loss: 1.6803
Last batch time taken: 0.304
Epoch 6/25
Epoch loss: 310.3031
Epoch time taken: 59.904
Last batch loss: 1.6227
Last batch time taken: 0.305
Epoch 7/25
Epoch loss: 297.5667
Epoch time taken: 60.228
Last batch loss: 1.5557
Last batch time taken: 0.305
Epoch 8/25
Epoch loss: 287.5213
Epoch time taken: 60.257
Last batch loss: 1.5227
Last batch time taken: 0.302
Epoch 9/25
Epoch loss: 277.3047
Epoch time taken: 60.095
Last batch loss: 1.4846
Last batch time taken: 0.297
Epoch 10/25
Epoch loss: 271.0766
Epoch time taken: 60.299
Last batch loss: 1.4610
Last batch time taken: 0.305
Epoch 11/25
Epoch loss: 263.7930
Epoch time taken: 60.388
Last batch loss: 1.4219
Last batch time taken: 0.302
Epoch 12/25
Epoch loss: 258.3275
Epoch time taken: 61.221
Last batch loss: 1.3926
Last batch time taken: 0.306
Epoch 13/25
Epoch loss: 253.9536
Epoch time taken: 61.370
Last batch loss: 1.3828
Last batch time taken: 0.304
Epoch 14/25
Epoch loss: 250.0217
Epoch time taken: 61.558
Last batch loss: 1.3729
Last batch time taken: 0.336
Epoch 15/25
Epoch loss: 246.6374
Epoch time taken: 60.847
Last batch loss: 1.3408
Last batch time taken: 0.304
Epoch 16/25
Epoch loss: 243.6650
Epoch time taken: 61.318
Last batch loss: 1.3302
Last batch time taken: 0.306
Epoch 17/25
Epoch loss: 240.9595
Epoch time taken: 60.983
Last batch loss: 1.3247
Last batch time taken: 0.306
Epoch 18/25
Epoch loss: 238.3090
Epoch time taken: 61.152
Last batch loss: 1.3060
Last batch time taken: 0.307
Epoch 19/25
Epoch loss: 236.3545
Epoch time taken: 60.511
Last batch loss: 1.2986
Last batch time taken: 0.299
Epoch 20/25
Epoch loss: 233.8637
Epoch time taken: 60.366
Last batch loss: 1.2872
Last batch time taken: 0.304
Epoch 21/25
Epoch loss: 231.9072
Epoch time taken: 60.312
Last batch loss: 1.2798
Last batch time taken: 0.305
Epoch 22/25
Epoch loss: 230.0432
Epoch time taken: 60.494
Last batch loss: 1.2702
Last batch time taken: 0.303
Epoch 23/25
Epoch loss: 228.2421
Epoch time taken: 60.489
Last batch loss: 1.2690
Last batch time taken: 0.309
Epoch 24/25
Epoch loss: 226.5291
Epoch time taken: 60.308
Last batch loss: 1.2467
Last batch time taken: 0.299
Epoch 25/25
Epoch loss: 225.4623
Epoch time taken: 60.597
Last batch loss: 1.2338
Last batch time taken: 0.305

Predicting Characters and Generating Novel Text

  • Now that the RNN is trained, we can use it to predict the next character given a string of previous characters.
  • Note that if we take the prediction character as input for the next timestep, we've now generated new text of some arbitrary length that we've specified!
  • Temperature is a hyperparameter dial that can tune more conservative predictions versus more diverse but higher error
In [21]:
def pick_char(prediction, top_n):
    """Sample from the top_n characters probabilistically.
    
    Parameters
    ----------
    prediction : array, float
        Prediction probabilities for each char class
    top_n : int
        Number of top candidates to sample from
    
    Returns
    -------
    prediction_index : int
        Index of the character selected
    """
    prediction = np.squeeze(prediction)
    # Set all the classes outside top_n to 0 probability
    prediction[np.argsort(prediction)[:-top_n]] = 0
    # Normalize sum probability to 1
    prediction = prediction / np.sum(prediction)
    # Randomly select but based on probabilities of likelihood
    prediction_index = np.random.choice(len(prediction), 1, p=prediction)[0]
    
    return prediction_index
In [22]:
def infer(tokenized_data, checkpoint, model, n_samples, text_seed, top_n):
    """Generate new text based on text seed and checkpoint weights.
    
    Parameters
    ----------
    tokenized_data : TokenizedData
        Pre-processed data
    checkpoint : Checkpoints
        Checkpoint file [default: None]
    model : CharRNN
        RNN model
    n_samples : int
        Number of characters to generate
    text_seed : str
        Initial string to prime the RNN state
    top_n : int
        Number of top candidate characters to sample from

    Returns
    -------
    generated_text : str
        Predicted text string
    """
    generated_text = text_seed
    generated_seq = np.squeeze(tokenized_data.t.texts_to_sequences(text_seed))
    
    saver = tf.train.Saver()
    
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.init_state)
        
        # Prime the RNN state with initial text seed
        for i, _ in enumerate(generated_text[:-1]):
            x = np.zeros((1, 1))
            x[0,0] = generated_seq[i]
            feed = {model.x: x,
                    model.keep_prob: 1.,
                    model.init_state: new_state}
            _, new_state = sess.run([model.predictions, 
                                     model.final_state],
                                    feed_dict=feed)

        # Start generating predictions after RNN state has been primed
        for _ in range(n_samples):
            x = np.zeros((1, 1))
            x[0,0] = generated_seq[-1]
            feed = {model.x: x,
                    model.keep_prob: 1.,
                    model.init_state: new_state}
            prediction, new_state = sess.run([model.predictions, 
                                              model.final_state],
                                             feed_dict=feed)
            # Pick character and append
            predicted_char_index = pick_char(prediction, top_n)
            predicted_char = tokenized_data.sequence_to_str([predicted_char_index])[0]
            generated_text = generated_text + str(predicted_char)
            generated_seq = np.append(generated_seq, predicted_char_index)
        
        return generated_text
In [23]:
def generate_text(tokenized_data, n_samples, text_seed, top_n, checkpoint=None):
    """Use trained model to generate new text based on an initial string.
    
    Parameters
    ----------
    tokenized_data : TokenizedData
        Pre-processed data
    n_samples : int
        Number of characters to generate
    text_seed : str
        Initial string to prime the RNN state
    top_n : int
        Number of top candidate characters to sample from
    checkpoint : Checkpoints
        Checkpoint file [default: None]
    """

    infer_model = CharRNN(len(tokenized_data.t.word_index), 
                          lstm_cell_size=LSTM_CELL_SIZE,
                          num_layers=NUM_LAYERS,
                          sampling=True)
    
    if checkpoint is None:
        checkpoint = tf.train.latest_checkpoint('checkpoints')
    
    print(infer(tokenized_data, checkpoint, infer_model, n_samples, text_seed, top_n))

Final Model - Pick from top 5 most likely characters

  • With the same trained model and text string seed, there still seems to be a fair amount of variation due to probabilistic selection of the top 5 characters.
In [24]:
generate_text(tokenized_data, n_samples=1000, text_seed='She ', top_n=5)
INFO:tensorflow:Restoring parameters from checkpoints/e24.ckpt
She would not have
coming in and so that they had seat to him then, between Anna
and he was already, she should show that his brother was satisfied with
a stale of three man. He had brought him about the state of his head.

"The preter story a story of the while to my heart, as I can't give my since
the seads of a misence about the children were not telling him of the
most instant what all man, why. It has are stopped, both is that I should
speak off; but I should be sent and so well, and what's if you have been
assidumed it and already at once it will be attached," said Vronsky, "I
have announced for them," said the sacrate and sharing suffering.

"I'll go and shall be some sort feeling at the carpecial, by her clearness
and ashem of that here we should shy live fin that will astitious over to to
her husband. Have your hands with the presents of material action?'


Chapter 4 


Peter Surdons, the meaning of his words and all the strices he was
their, simply so all the things that to say, 
In [25]:
generate_text(tokenized_data, n_samples=1000, text_seed='She ', top_n=5)
INFO:tensorflow:Restoring parameters from checkpoints/e24.ckpt
She had never
been an iresting important above the sacrament the strange that
churged by the party, which they had not taken to stait to have so standing
and trinking all and sorred to ask her thinking."

"What? Then I don't know her to grasp the little old position."

"Why, was it that?" asked Kitty, thinking he was tea of her heart; and
he shaked her hair, and they all said: "Well, train then you suppose he's
to be teaching it, trancaition for those people. I suppose," he asked
words that they had stretched his way at the correct of the brilliant
worked, when he could never told her he could not see that he would be it
asked and thought of another couldn't say wanted to allow. The country
was not asked. And she stopped with harmly could be standing at their
character. All she would not come about that this, and this they all
waided, take to me that he wanted to say a sort; there was that she
meant towards the carriage.

His first distinctly answered he was the silence, and the plung had,

Final Model - Pick from top 1 most likely characters

  • Not enough randomness - somehow gets into some determinate pattern?
In [26]:
generate_text(tokenized_data, n_samples=1000, text_seed='She ', top_n=1)
INFO:tensorflow:Restoring parameters from checkpoints/e24.ckpt
She was a strange that had been all the same thing to her husband and her
husband and her husband and her husband and her husband was a strange that
had been said that he was a strange thing that he had been at the same
time that he was a strange thing that he had been and he was a strange
thing that he had not seen her husband, and the same thing was the same
thing to her husband and her husband and her husband and her husband's
hands and the same thing was the same thing that he had been at the
same time that he was a strange thing that he had been and the same
thing to her husband and her husband and her husband and her husband's
hands and the same thing was the same thing that he had been at the
same time that he was a strange thing that he had been and the same
thing to her husband and her husband and her husband and her husband's
hands and the same thing was the same thing that he had been at the
same time that he was a strange thing that he had been and the same
thing to her husband

Final Model - Pick from top 2 most likely characters

  • Top 2 seemed better but still lots of repeated words
In [27]:
generate_text(tokenized_data, n_samples=1000, text_seed='She ', top_n=2)
INFO:tensorflow:Restoring parameters from checkpoints/e24.ckpt
She was a long while and to see him, and he had no serving, and to the
same thing was to be attracting herself that the children had no seen
from their son, and the province and the same serving and heart that
he was all that he had been to be at a stand of anything, but to see his
heart with his wife, which he had not told him to the strange that he was
staying and starting at him and the same sort of his son when he had
said that he had been angrily to her thoughts, which he had not told her
her face, and her hand with his heart and the same, and there was not a stand
to the state of the same strain of an intimacy, and her sisting of
his hands, with the state of the strange on the state, with a chief strain
of her soul with a smile, and straight at the stands and the conversation
that he had been and that the children had not been and his soul any
of the consequence of the station was a sense of his heart, as he had
said to herself, and he was so standing at the same time there was a
sen

Epoch 0, 5, 10 Model - Pick from top 5 most likely characters

  • Taking a look at how it's learning over the epochs
In [28]:
generate_text(tokenized_data, n_samples=1000, text_seed='She ', top_n=5, 
                 checkpoint='./checkpoints/e5.ckpt')
INFO:tensorflow:Restoring parameters from ./checkpoints/e5.ckpt
She that she had not annow had not seemed had been to been the
sarring. He had
both some
to did not call at the masters, bush though the counts of the same in a croan
tall his helpess, and the many as a cread interriagely to him thried in what said and thoughing the provent would be too and that it was have and all this were and
staking her hought to hour only take in shorl to anyway.

The sermed as to be to the contress of milent, but his his ofters and himself the potrer and the
bealond time of the plass of whom
his begard anyand woman to say a sitter into the departs of with has beginning. He desared off the
mors shooks, husband
that showing the plincess with so so treing, and was that the somether was through an him, and her husbouss with
surdly anywhicels of the mildran, and how thing her has tall the dore. Bet with the process and and his face, and always
he were to happored to to be to home to
angar astation. He did netr that the
cold, to happen though she was say all
her. She that 
In [29]:
generate_text(tokenized_data, n_samples=1000, text_seed='She ', top_n=5, 
                 checkpoint='./checkpoints/e10.ckpt')
INFO:tensorflow:Restoring parameters from ./checkpoints/e10.ckpt
She had not been
the sone, with the play of the soried somity to she had been all the
truck and with the morning. "Alexey Alexandrovitch, I should seaking
a standing free this about it all?.."

"No, I'm an idea, and should be the possible what has teartion it
wishing him in."

She had seet that, but had been dress of things. He had been saw the
same to to seemente to her so and the princess of the sacrast of
the position.

"If it! All, the sens of attering, as they house," he said in what he saw
what he had said to this same a strung, but shaking once the that which
she was not something when which had not talked it all of his hands to him
thoughts the creater and with the proness, and with the manshart and
helppended with the same sound to him, had been stonly were tolling
about her to him, and with the secon all this with almost watch androom,
he was the fairs, that seemed of a sort of whether the man had and was
consincing the call of all her seat in the coller waith at
the conversation
In [30]:
generate_text(tokenized_data, n_samples=1000, text_seed='She ', top_n=5, 
                 checkpoint='./checkpoints/e15.ckpt')
INFO:tensorflow:Restoring parameters from ./checkpoints/e15.ckpt
She was and thanking of was that way they had bared thooget, and those
pretending and their sunders were trying on him, and when he had seen
some some of an inclight of still outside this and would be to go
at the clooks at her healthy. "I am get to the fartherion," he said, with
a smile. "And I don't want to say that that's not seeing a some
seen on this about to say a love what have been answer. But it must
be so a single telegram."

There was a soul of the meaning to the mother, though that seemed in
the doorway in the clancest with the plain with hards weet as he was
continually from his hand, without her, his face, and as all she was a same.
Taking that herself, she had succored to be delivered, taken on this
was her sounds. She was so telling the princess to see, and there was
nothing and any misery, that they were thinking of all to the
driving room to take of the country to take that the cretty of
tears was a gentleman, as his bather were his face she was and how he
had been coming
In [31]:
generate_text(tokenized_data, n_samples=1000, text_seed='She ', top_n=5, 
                 checkpoint='./checkpoints/e20.ckpt')
INFO:tensorflow:Restoring parameters from ./checkpoints/e20.ckpt
She would
be stating on the painting, and staying a mere, and so the tone of
the way of his birit, and was serious fearful. He did not care to get
away. And there was no striking of himself will have need of the
same to show the marshal were never been in his soul, and a conversation.
They set to be coming up for a love of what had so married in them,
and she was as it was about his face and the careful force. They were
saying of all the country and the money in her force to her.

He had not conscious of the same thing. Two change to the provises of
his complete winds of the memory of that service--which had not called
in the present of the treater. But, at the morning, and his
stream, well sat down, an attempt of her forcestable
position, a desire all the same and that in a letter the peasants was
tried to gave it, and the soulder of their characteristic to be
thought, but was that he had been saying at the man who sent at the
side of the side. The sacress had been saying, but washed off 
In [ ]: