In [None]:
import numpy as np
import torch
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# RNNs and LSTMs for Character-Level Language Models

In this tutorial, we will reproduce Andrej Karpathy's character level language model, which is described here: http://karpathy.github.io/2015/05/21/rnn-effectiveness/. This tutorial is largely based on this blog post, but has been updated to use PyTorch where applicable. 

The objective of this character level language model is to predict the next character given a sequence of previously observed characters. In this tutorial, we explore the ability for Recurrent Neural Networks (RNNs) and LSTMs to perform this task. Characters are converted into a one-hot vector. We can then view the softmax output of each RNN cell as a probability distribution over the possible next characters. From Karpathy's blog post, we show a visualization of the task: 

![title](http://karpathy.github.io/assets/rnn/charseq.jpeg)
The output of each RNN cell is the probability distribution for the next character. We can then sample the next character using this distribution. At evaluation time, we can feed each sampled character into the RNN as an input, allowing us to generate a sequence of text.

## Data Preprocessing
In this notebook, we will be training on a dataset of Shakespearian dialogue. First, lets download the data, and inspect a sample passage.

In [None]:
import urllib.request
url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
filename = 'shakespeare_data.txt'
urllib.request.urlretrieve(url, filename)


data_file = open(filename, 'r')
raw_data = data_file.read()
data_file.close()

print(raw_data[:200])

We will represent characters using a one hot encoding. Each character in the vocabulary is first mapped to an index, and we then define a function to map this index to a one-hot vector and vice versa.

In [None]:
data_length = len(raw_data)
vocab = list(set(raw_data))
vocab_size = len(vocab)

char_to_index = { char:index for (index,char) in enumerate(vocab) }
index_to_char = { index:char for (index,char) in enumerate(vocab) }

print("The vocabulary contains {}".format(vocab))
print("------------------------------")
print("TOTAL NUM CHARACTERS = {}".format(data_length))
print("NUM UNIQUE CHARACTERS = {}".format(vocab_size))
print('char_to_index {}'.format(char_to_index))

This tutorial will use a simplistic method to extract sequences from this dataset. We will simply chunk the data into evenly sized sub-sequences, and discard the remaining data. This is not a good practice in reality, and may hurt performance!

In [None]:
from torch.utils.data import Dataset, DataLoader

def create_one_hot(ind, length):
    """Convert index into one-hot vector, where the index is set to hot."""
    vec = np.zeros(length)
    vec[ind] = 1
    return vec

def chunk_data(raw_data, seq_len):
    """Splits raw data into evenly sized chunks."""
    chunks = []

    for i in range(len(raw_data) // seq_len):
        start = i * seq_len
        end = start + seq_len + 1
        chunk = raw_data[start:end]

        chunks.append(chunk)
    return chunks

def convert_dataset(dataset, char_to_index, vocab_size):
    """Convert dataset of character sequences into index and one hot data."""
    ind_dataset = []
    one_hot_dataset = []

    for seq in dataset:
        ind_seq = [char_to_index[c] for c in seq]
        one_hot_seq = [create_one_hot(ind, vocab_size) for ind in ind_seq]

        ind_dataset.append(ind_seq)
        one_hot_dataset.append(one_hot_seq)

    return np.array(ind_dataset), np.array(one_hot_dataset)

class ShakespeareDataset(Dataset):
    def __init__(self, inds, one_hot):
        self.inds = inds
        self.one_hot = one_hot

    def __len__(self):
        return self.one_hot.size(0)

    def __getitem__(self, idx):
        # Note that we offset the data here, so the target for each character
        # is the next character in the sequence.
        input_onehot = self.one_hot[idx, :-1, :]
        target_ind = self.inds[idx, 1:]

        return input_onehot, target_ind

CHUNK_LEN = 25

data_chunks = chunk_data(raw_data, CHUNK_LEN)
train_ind, train_oh = convert_dataset(data_chunks, char_to_index, vocab_size)

# Send data to GPU
train_ind_tt = torch.Tensor(train_ind).long().to(device)
train_oh_tt = torch.Tensor(train_oh).float().to(device)

train_set = ShakespeareDataset(train_ind_tt, train_oh_tt)

# Recurrent Neural Networks

Below we define our own RNN implementation. Recall that the RNN performs the following operations:


$$ 
\begin{align} h_t &= W_{ih} x_t + W_{hh} h_{t-1} + b_{ih} + b_{hh}\\
 a_t &= \text{tanh}(h_t) \\
 o_t &= \text{softmax}(W_{ho} a_t + b_{ho}) 
 \end{align} 
$$
 
 
You may find the following resources helpful for understanding how RNNs and LSTMs work:

* [The Unreasonable Effectiveness of RNNs (Andrej Karpathy)](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
* [Recurrent Neural Networks Tutorial (Wild ML)](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
* [Understanding LSTM Networks (Chris Olah)](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
 
 
## Implementation

In [None]:
class MyRNNCell(nn.Module):
    def __init__(self, obs_dim, hidden_size, output_dim):
        """Initialize RNN Cell."""
        super().__init__()
        self.hidden_size = hidden_size
        
        # Merge input / hidden weights into single module.
        self.i2h = nn.Linear(obs_dim + hidden_size, hidden_size)
        self.h2o = nn.Linear(hidden_size, output_dim)

        self.tanh = nn.Tanh()
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, data, hidden):
        """Compute forward pass for this RNN cell."""
        combined = torch.cat((data, hidden), 1)

        hidden = self.i2h(combined)
        hidden = self.tanh(hidden)

        output = self.h2o(hidden)
        output = self.softmax(output)

        return output, hidden
    
class MyRNN(nn.Module):
    def __init__(self, obs_dim, hidden_size, output_dim):
        """Initialize RNN."""
        super().__init__()
        self.hidden_size = hidden_size
        self.output_dim = output_dim

        self.rnn_cell = MyRNNCell(obs_dim, hidden_size, output_dim)

    def forward(self, x):
        """Compute forward pass on sequence x.
        
        Input sequence x has shape (B x L x D), where:
        B is batch size, L is sequence length, and D is the number of features.
        """
        batch_size, seq_len, n_feat = x.size()
        
        # Stores outputs of RNN cell
        output_arr = torch.zeros((batch_size, seq_len, self.output_dim))
        hidden_arr = torch.zeros((batch_size, seq_len, self.hidden_size))
        
        # Send to GPU. This is a gotcha, make sure to send Tensors created
        # in a model to the same device as input Tensors.
        output_arr = output_arr.float().to(x.device)
        hidden_arr = hidden_arr.float().to(x.device)

        hidden = self.init_hidden(batch_size, x.device)

        for i in range(seq_len):
            # For each iteration, compute RNN on input for current position
            output, hidden = self.rnn_cell(x[:, i, :], hidden)

            output_arr[:, i, :] = output
            hidden_arr[:, i, :] = hidden

        return output_arr, hidden_arr

    def init_hidden(self, batch_size, device):
        """Initialize RNN hidden state.
        
        Some people advocate for using random noise instead of zeros, or 
        training for the initial state. Personally, I don't know if it matters!
        """
        return torch.zeros(batch_size, self.hidden_size, device=device)

## Training

In [None]:
def generate_seq(model, init_char_one_hot, length):
    """Generate sequence using autoregressive scheme.
    
    This is a little messy, but the core concept is to:

      1. Get the distribution of next characters from the RNN
      2. Use this distribution to sample the next character
      3. Feed the sampled character into the RNN, and repeat    
    """
    curr_char = init_char_one_hot
    output = index_to_char[torch.argmax(curr_char.squeeze()).item()]

    for i in range(length):
        out, _ = model(curr_char)

        # Since our output is a probability distribution, we can sample from it
        p = np.exp(out[:, -1, :].cpu().detach().numpy())
        out_ind = np.random.choice(range(vocab_size), p=p.ravel())
        
        out_char = index_to_char[out_ind]
        
        output += out_char
        
        # Use sampled output as input for next time step
        curr_char = create_one_hot(out_ind, vocab_size)
        curr_char = torch.Tensor(curr_char).float().to(device).view(1, 1, -1)

    return output

In [None]:
def train_loop(model, optimizer, train_loader, n_epochs, test_char=None):
    for epoch in range(n_epochs):
        avg_loss = []
        for input_seq, target_ind in train_loader:
            optimizer.zero_grad()

            output, _ = model(input_seq)
            loss = nn.NLLLoss()(output.transpose(1, 2), target_ind)

            loss.backward()
            
            # =================================================================
            # This is how to do gradient clipping in PyTorch
            torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
            # =================================================================
            
            optimizer.step()
            avg_loss.append(loss.item())

        print('Epoch {} : Avg Train Loss {}'.format(epoch, np.mean(avg_loss)))

        # Generate sequence
        if test_char is not None:
            gen_seq = generate_seq(model, test_char, 100)
            print("Generated Sequence:\n {}".format(gen_seq))

In [None]:
HIDDEN_SIZE = 100
N_EPOCH = 100
LR = 0.01
BATCH_SIZE = 64
SAMP_CHAR = 'a'

train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)

model = MyRNN(vocab_size, HIDDEN_SIZE, vocab_size).to(device)
optim = torch.optim.Adam(model.parameters(), lr=LR)

test_char = create_one_hot(char_to_index[SAMP_CHAR], vocab_size)
test_char_tt = torch.Tensor(test_char).view(1, 1, -1).float().to(device)

train_loop(model, optim, train_loader, N_EPOCH, test_char_tt)

# LSTMs

Long short-term memory (LSTM) units contain a cell state, which allows long term dependencies to propogate through the RNN. The LSTM is represented by:

$$
        \begin{array}{ll} \\
            i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi}) \\
            f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{t-1} + b_{hf}) \\
            g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{t-1} + b_{hg}) \\
            o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho}) \\
            c_t = f_t \odot c_{t-1} + i_t \odot g_t \\
            h_t = o_t \odot \tanh(c_t) \\
        \end{array}
$$
        
Due to the assignment this year, we don't be showing a detailed implementation here. Instead, we use the builting PyTorch LSTM:  
https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

In [None]:
class MyLSTM(nn.Module):
    """Wraps PyTorch NN with output network"""

    def __init__(self, obs_dim, hid_size, num_layers):
        """Initialize MyLSTM."""
        super().__init__()
        # Using built-in PyTorch LSTM, see source code for implementation.
        self.lstm = nn.LSTM(obs_dim, hid_size, num_layers=num_layers, 
                            batch_first=True)

        self.out = nn.Linear(hid_size, obs_dim)
        self.act = nn.LogSoftmax(dim=2)

    def forward(self, x):
        lstm_out = self.lstm(x)[0]
        out = self.act(self.out(lstm_out) / 0.5)
        return out, None


In [None]:
HIDDEN_SIZE = 512
NUM_LAYERS = 3
N_EPOCH = 10000

LR = 0.001
BATCH_SIZE = 32
SAMP_CHAR = 'a'

train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)

model = MyLSTM(vocab_size, HIDDEN_SIZE, NUM_LAYERS).to(device)
optim = torch.optim.Adam(model.parameters(), lr=LR)

test_char = create_one_hot(char_to_index[SAMP_CHAR], vocab_size)
test_char_tt = torch.Tensor(test_char).view(1, 1, -1).float().to(device)

train_loop(model, optim, train_loader, N_EPOCH, test_char_tt)

# Wrap-Up

Apparently it takes several hours to train this model, which we obviously don't have!

Refer to the website for results: http://karpathy.github.io/2015/05/21/rnn-effectiveness/. 

A 100,000 character sample output of the trained LSTM model can be found at: https://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt. It seems very good! We include a snippet of the output from the website below:

>PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

>Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.

>DUKE VINCENTIO:
Well, your wit is in the care of side and that.

>Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.

>Clown:
Come, sir, I will make did behold your worship.

>VIOLA:
I'll drink it.