RNNs in Tensorflow - A Practical Guide and Undocumented Features

In a previous tutorial series I went over some of the theory behind Recurrent Neural Networks (RNNs) and the implementation of a simple RNN from scratch. That’s a useful exercise, but in practice we use libraries like Tensorflow with high-level primitives for dealing with RNNs.

With such libraries, using an RNN should be as easy as calling a function, right? Unfortunately that’s not quite the case. In this post I want to go over some of the best practices for working with RNNs in Tensorflow, especially the functionality that isn’t well documented on the official site.

The post comes with a Github repository that contains Jupyter notebooks with minimal examples for:

Preprocessing Data: Use tf.SequenceExample #

Check out the tf.SequenceExample Jupyter Notebook here!

RNNs are used for sequential data that has inputs and/or outputs at multiple time steps. Tensorflow comes with a protocol buffer definition to deal with such data: tf.SequenceExample.

You can load data directly from your Python/Numpy arrays, but it’s probably in your best interest to use tf.SequenceExample instead. This data structure consists of a “context” for non-sequential features and “feature_lists” for sequential features. It’s somewhat verbose (it blew up my latest dataset by 10x), but it comes with a few benefits that are worth it:

Easy distributed training. Split up data into multiple TFRecord files, each containing many SequenceExamples, and use Tensorflow’s built-in support for distributed training.
Reusability. Other people can re-use your model by bringing their own data into tf.SequenceExample format. No model code changes required.
Use of Tensorflow data loading pipelines functions like tf.parse_single_sequence_example. Libraries like tf.learn also come with convenience function that expect you to feed data in protocol buffer format.
Separation of data preprocessing and model code. Using tf.SequenceExample forces you to separate your data preprocessing and Tensorflow model code. This is good practice, as your model shouldn’t make any assumptions about the input data it gets.

In practice, you write a little script that converts your data into tf.SequenceExample format and then writes one or more TFRecord files. These TFRecord files are parsed by Tensorflow to become the input to your model:

Convert your data into tf.SequenceExample format
Write one or more TFRecord files with the serialized data
Use tf.TFRecordReader to read examples from the file
Parse each example using tf.parse_single_sequence_example (Not in the official docs yet)

sequences = [[1, 2, 3], [4, 5, 1], [1, 2]]
label_sequences = [[0, 1, 0], [1, 0, 0], [1, 1]]

def make_example(sequence, labels):
    # The object we return
    ex = tf.train.SequenceExample()
    # A non-sequential feature of our example
    sequence_length = len(sequence)
    ex.context.feature["length"].int64_list.value.append(sequence_length)
    # Feature lists for the two sequential features of our example
    fl_tokens = ex.feature_lists.feature_list["tokens"]
    fl_labels = ex.feature_lists.feature_list["labels"]
    for token, label in zip(sequence, labels):
        fl_tokens.feature.add().int64_list.value.append(token)
        fl_labels.feature.add().int64_list.value.append(label)
    return ex

# Write all examples into a TFRecords file
with tempfile.NamedTemporaryFile() as fp:
    writer = tf.python_io.TFRecordWriter(fp.name)
    for sequence, label_sequence in zip(sequences, label_sequences):
        ex = make_example(sequence, label_sequence)
        writer.write(ex.SerializeToString())
    writer.close()

And, to parse an example:

# A single serialized example
# (You can read this from a file using TFRecordReader)
ex = make_example([1, 2, 3], [0, 1, 0]).SerializeToString()

# Define how to parse the example
context_features = {
    "length": tf.FixedLenFeature([], dtype=tf.int64)
}
sequence_features = {
    "tokens": tf.FixedLenSequenceFeature([], dtype=tf.int64),
    "labels": tf.FixedLenSequenceFeature([], dtype=tf.int64)
}

# Parse the example
context_parsed, sequence_parsed = tf.parse_single_sequence_example(
    serialized=ex,
    context_features=context_features,
    sequence_features=sequence_features
)

Batching and Padding Data #

Check out the Jupyer Notebook on Batching and Padding here!

Tensorflow’s RNN functions expect a tensor of shape [B, T, ...] as input, where B is the batch size and T is the length in time of each input (e.g. the number of words in a sentence). The last dimensions depend on your data. Do you see a problem with this? Typically, not all sequences in a single batch are of the same length T, but in order to feed them into the RNN they must be. Usually that’s done by padding them: Appending 0‘s to examples to make them equal in length.

Now imagine that one of your sequences is of length 1000, but the average length is 20. If you pad all your examples to length 1000 that would be a huge waste of space (and computation time)! That’s where batch padding comes in. If you create batches of size 32, you only need to pad examples within the batch to the same length (the maximum length of examples in that batch). That way, a really long example will only affect a single batch, not all of your data.

That all sounds pretty messy to deal with. Luckily, Tensorflow has built-in support for batch padding. If you set dynamic_pad=True when calling tf.train.batch the returned batch will be automatically padded with 0s. Handy! A lower-level option is to use tf.PaddingFIFOQueue.

Side note: Be careful with 0’s in your vocabulary/classes #

If you have a classification problem and your input tensors contain class IDs (0, 1, 2, …) then you need to be careful with padding. Because you are padding tensos with 0’s you may not be able to tell the difference between 0-padding and “class 0”. Whether or not this can be a problem depends on what your model does, but if you want to be safe it’s a good idea to not use “class 0” and instead start with “class 1”. (An example of where this may become a problem is in masking the loss function. More on that later).

# [0, 1, 2, 3, 4 ,...]
x = tf.range(1, 10, name="x")

# A queue that outputs 0,1,2,3,...
range_q = tf.train.range_input_producer(limit=5, shuffle=False)
slice_end = range_q.dequeue()

# Slice x to variable length, i.e. [0], [0, 1], [0, 1, 2], ....
y = tf.slice(x, [0], [slice_end], name="y")

# Batch the variable length tensor with dynamic padding
batched_data = tf.train.batch(
    tensors=[y],
    batch_size=5,
    dynamic_pad=True,
    name="y_batch"
)

# Run the graph
# tf.contrib.learn takes care of starting the queues for us
res = tf.contrib.learn.run_n({"y": batched_data}, n=1, feed_dict=None)

# Print the result
print("Batch shape: {}".format(res[0]["y"].shape))
print(res[0]["y"])

This gives:

Batch shape: (5, 4)
[0 0 0 0]
[1 0 0 0]
[1 2 0 0]
[1 2 3 0]
[1 2 3 4]]

And, the same with PaddingFIFOQueue:

# ... Same as above

# Creating a new queue
padding_q = tf.PaddingFIFOQueue(
    capacity=10,
    dtypes=tf.int32,
    shapes=[[None]])

# Enqueue the examples
enqueue_op = padding_q.enqueue([y])

# Add the queue runner to the graph
qr = tf.train.QueueRunner(padding_q, [enqueue_op])
tf.train.add_queue_runner(qr)

# Dequeue padded data
batched_data = padding_q.dequeue_many(5)

# ... Same as above

rnn vs. dynamic_rnn #

Check out the dynamic_rnn Jupyter Notebook here!

You may have noticed that Tensorflow contains two different functions for RNNs: tf.nn.rnn and tf.nn.dynamic_rnn. Which one to use?

Internally, tf.nn.rnn creates an unrolled graph for a fixed RNN length. That means, if you call tf.nn.rnn with inputs having 200 time steps you are creating a static graph with 200 RNN steps. First, graph creation is slow. Second, you’re unable to pass in longer sequences (> 200) than you’ve originally specified.

tf.nn.dynamic_rnn solves this. It uses a tf.While loop to dynamically construct the graph when it is executed. That means graph creation is faster and you can feed batches of variable size. What about performance? You may think the static rnn is faster than its dynamic counterpart because it pre-builds the graph. In my experience that’s not the case.

In short, just use tf.nn.dynamic_rnn. There is no benefit to tf.nn.rnn and I wouldn’t be surprised if it was deprecated in the future.

Passing sequence_length to your RNN #

Check out the Jupyter Notebook on Bidirectional RNNs here!

When using any of Tensorflow’s rnn functions with padded inputs it is important to pass the sequence_length parameter. In my opinion this parameter should be required, not optional. sequence_length serves two purposes: 1. Save computational time and 2. Ensure Correctness.

Let’s say you have a batch of two examples, one is of length 13, and the other of length 20. Each one is a vector of 128 numbers. The length 13 example is 0-padded to length 20. Then your RNN input tensor is of shape [2, 20, 128]. The dynamic_rnn function returns a tuple of (outputs, state), where outputs is a tensor of size [2, 20, ...] with the last dimension being the RNN output at each time step. state is the last state for each example, and it’s a tensor of size [2, ...] where the last dimension also depends on what kind of RNN cell you’re using.

So, here’s the problem: Once your reach time step 13, your first example in the batch is already “done” and you don’t want to perform any additional calculation on it. The second example isn’t and must go through the RNN until step 20. By passing sequence_length=[13,20] you tell Tensorflow to stop calculations for example 1 at step 13 and simply copy the state from time step 13 to the end. The output will be set to 0 for all time steps past 13. You’ve just saved some computational cost. But more importantly, if you didn’t pass sequence_length you would get incorrect results! Without passing sequence_length, Tensorflow will continue calculating the state until T=20 instead of simply copying the state from T=13. This means you would calculate the state using the padded elements, which is not what you want.

# Create input data
X = np.random.randn(2, 10, 8)

# The second example is of length 6
X[1,6:] = 0
X_lengths = [10, 6]

cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)

outputs, last_states = tf.nn.dynamic_rnn(
    cell=cell,
    dtype=tf.float64,
    sequence_length=X_lengths,
    inputs=X)

result = tf.contrib.learn.run_n(
    {"outputs": outputs, "last_states": last_states},
    n=1,
    feed_dict=None)

assert result[0]["outputs"].shape == (2, 10, 64)

# Outputs for the second example past past length 6 should be 0
assert (result[0]["outputs"][1,7,:] == np.zeros(cell.output_size)).all()

Bidirectional RNNs #

When using a standard RNN to make predictions we are only taking the “past” into account. For certain tasks this makes sense (e.g. predicting the next word), but for some tasks it would be useful to take both the past and the future into account. Think of a tagging task, like part-of-speech tagging, where we want to assign a tag to each word in a sentence. Here we already know the full sequence of words, and for each word we want to take not only the words to the left (past) but also the words to the right (future) into account when making a prediction. Bidirectional RNNs do exactly that. A bidirectional RNN is a combination of two RNNs – one runs forward from “left to right” and one runs backward from “right to left”. These are commonly used for tagging tasks, or when we want to embed a sequence into a fixed-length vector (beyond the scope of this post).

Just like for standard RNNs, Tensorflow has static and dynamic versions of the bidirectional RNN. As of the time of this writing, the bidirectional_dynamic_rnn is still undocumented, but it’s preferred over the static bidirectional_rnn.

The key differences of the bidirectional RNN functions are that they take a separate cell argument for both the forward and backward RNN, and that they return separate outputs and states for both the forward and backward RNN:

X = np.random.randn(2, 10, 8)

X[1,6,:] = 0
X_lengths = [10, 6]

cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)

outputs, states  = tf.nn.bidirectional_dynamic_rnn(
    cell_fw=cell,
    cell_bw=cell,
    dtype=tf.float64,
    sequence_length=X_lengths,
    inputs=X)

output_fw, output_bw = outputs
states_fw, states_bw = states

RNN Cells, Wrappers and Multi-Layer RNNs #

Check out the Jupyter Notebook on RNN Cells here!

All Tensorflow RNN functions take a cell argument. LSTMs and GRUs are the most commonly used cells, but there are many others, and not all of them are documented. Currently, the best way to get a sense of what cells are available is to look at at rnn_cell.py and contrib/rnn_cell.

As of the time of this writing, the basic RNN cells and wrappers are:

BasicRNNCell – A vanilla RNN cell.
GRUCell – A Gated Recurrent Unit cell.
BasicLSTMCell – An LSTM cell based on Recurrent Neural Network Regularization. No peephole connection or cell clipping.
LSTMCell – A more complex LSTM cell that allows for optional peephole connections and cell clipping.
MultiRNNCell – A wrapper to combine multiple cells into a multi-layer cell.
DropoutWrapper – A wrapper to add dropout to input and/or output connections of a cell.

and the contributed RNN cells and wrappers:

CoupledInputForgetGateLSTMCell – An extended LSTMCell that has coupled input and forget gates based on LSTM: A Search Space Odyssey.
TimeFreqLSTMCell – Time-Frequency LSTM cell based on Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks
GridLSTMCell – The cell from Grid Long Short-Term Memory.
AttentionCellWrapper – Adds attention to an existing RNN cell, based on Long Short-Term Memory-Networks for Machine Reading.
LSTMBlockCell – A faster version of the basic LSTM cell (Note: this one is in lstm_ops.py)

Using these wrappers and cells is simple, e.g.

cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)
cell = tf.nn.rnn_cell.DropoutWrapper(cell=cell, output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell(cells=[cell] * 4, state_is_tuple=True)

Calculating sequence loss on padded examples #

Check out the Jupyter Notebook on Calculating Loss here!

For sequence prediction tasks we often want to make a prediction at each time step. For example, in Language Modeling we try to predict the next word for each word in a sentence. If all of your sequences are of the same length you can use Tensorflow’s sequence_loss and sequence_loss_by_example functions (undocumented) to calculate the standard cross-entropy loss.

However, as of the time of this writing sequence_loss does not support variable-length sequences (like the ones you get from a dynamic_rnn). Naively calculating the loss at each time step doesn’t work because that would take into account the padded positions. The solution is to create a weight matrix that “masks out” the losses at padded positions.

Here you can see why 0-padding can be a problem when you also have a “0-class”. If that’s the case you cannotuse tf.sign(tf.to_float(y)) to create a mask because that would mask out the “0-class” as well. You can still create a mask using the sequence length information, it just becomes more complicated.

# Batch size
B = 4
# (Maximum) number of time steps in this batch
T = 8
RNN_DIM = 128
NUM_CLASSES = 10

# The *acutal* length of the examples
example_len = [1, 2, 3, 8]

# The classes of the examples at each step (between 1 and 9, 0 means padding)
y = np.random.randint(1, 10, [B, T])
for i, length in enumerate(example_len):
    y[i, length:] = 0

# The RNN outputs
rnn_outputs = tf.convert_to_tensor(np.random.randn(B, T, RNN_DIM), dtype=tf.float32)

# Output layer weights
W = tf.get_variable(
    name="W",
    initializer=tf.random_normal_initializer(),
    shape=[RNN_DIM, NUM_CLASSES])

# Calculate logits and probs
# Reshape so we can calculate them all at once
rnn_outputs_flat = tf.reshape(rnn_outputs, [-1, RNN_DIM])
logits_flat = tf.batch_matmul(rnn_outputs_flat, W)
probs_flat = tf.nn.softmax(logits_flat)

# Calculate the losses
y_flat =  tf.reshape(y, [-1])
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits_flat, y_flat)

# Mask the losses
mask = tf.sign(tf.to_float(y_flat))
masked_losses = mask * losses

# Bring back to [B, T] shape
masked_losses = tf.reshape(masked_losses,  tf.shape(y))

# Calculate mean loss
mean_loss_by_example = tf.reduce_sum(masked_losses, reduction_indices=1) / example_len
mean_loss = tf.reduce_mean(mean_loss_by_example)