Deep NLP: Sequential Models with RNNs


When we are using NLP to deal with textual data, one key point we must understand is that the data is always in the form of sequences and the order of the data matters. For any given sentence, if the order of words is changed, the meaning of the sentence doesn’t stay the same, hence we can say that the sentence information is stored in both the words as well as the order of the words in that particular sentence. In any type of data, if the sequential order matters, we call it sequential data.

Traditional neural networks typically cannot handle sequential data. This is because when we build a neural network for a particular task, we need to set a fixed input size at the beginning, but in sequential data, the size of the data can vary. A sentence can contain 5 words, or 20 words, hence we cannot configure a neural network to effectively deal with this kind of data. Even if we were dealing with sentences with the same number of words, which is an ideal scenario, when we input the processed words into a neural network of some fixed input size, a neural network is not designed to pay attention to the sequence of the words. The model will effectively learn from the semantic information of the individual words in the sentence, but it will fail to learn from the order of the words in the sentence.

To convert textual data into numerical format so that we can input them into neural networks, we must convert them into vectors. These can be either one hot encoded vectors or word vectors. I have explained about these in the previous Deep NLP article over here. So our textual data will turn into a sequence of vectors, which is exactly the format we need.

Convering word sequence into word vectors sequence
Note: To fully understand this article, it is recommended to have a fundamental understanding of deep learning concepts like loss, backpropagation and weights. I have written about all of these concepts in my publication here. Please go through them to have a better understanding of this article.

Recurrent Neural Networks

To deal with sequence data, specialized neural networks called Sequential Models are used. One of the most basic sequential models are Reccurent Neural Networks (RNNs). RNNs essentially consider the information of each element of the sequence, as well as the information of all the elements in the sequence before that particular element. So essentially RNNs have two input layers, one for the current word being fed into the RNN, and one for the accumulated information of all the previous words in that sentence. RNNs also have two outputs, one is the primary output containing the relevant prediction which we are training the RNN for, we usually ignore this output untill all the words have been fed into the RNN. The other output is meant to represent all the accumulated information of the words which have been input into the RNN so far. This is how the basic representation of an RNN looks like:

Simple representation of RNN’s inputs and outputs

So the entire process of feeding a sequence into an RNN is as follows:

  1. Feed the first word into the RNN’s first input, since this is the first word there will be no previous information so we will feed zero into the second input.
  2. At this stage, the RNN will output the accumulated information of that word and a primary output, we usually ignore the primary output of the RNN until all the words of the sentence have been fed into the RNN.
  3. Then we feed the 2nd word into the RNN’s first input and the accumulated information into the second input of the RNN.
  4. The RNN will then output the accumulated information again, this time it contains info of both the first word and the second word.
  5. We continue this process until all the words have been fed into the RNN and then we check the primary output of the RNN to get the prediction regarding the sentence. The accumulated information keeps getting updated as the RNN processes each word.

An animated explanation is provided here:

There is only one RNN with a set of weights which is processing all the words in a sentence, but for easier visualization, we can unroll the entire process as follows:

Unrolled representation of RNN
It is very important to understand that there is only one RNN with a set of weights in this scenario, but we display four RNNs for easier visualization of what’s going on.

State of an RNN

The RNN uses the accumulated information between each step as a kind of a “memory” functionality so that it remembers all the words it has processed before. At the final step, the RNN has essentially processed all the words in the sentence in the exact order which is required. The accumulated information which is being passed between steps is also called the “STATE” of the RNN.

The state is essentially just a vector of numbers which is passed between the output of previous step to the input of the next step. As the RNN get’s trained, it will learn to modify the state such that it can derive relevant information about the words it has processed and the order in which they are processed.

The state (accumulated information) in the RNN essentially gives the RNN “memory capabilities” so that it knows all the words is has processed so far as well as the order in which they were processed.

Inner Functioning of an RNN

Let’s have a deeper look at the inner working of an RNN:

Inner functioning of an RNN

Let’s break down the entire process of what goes on inside an RNN:

Firstly, the current word (nth) is multiplied with W1, which is a weight matrix in the RNN. The previous state (n-1th) is multiplied with W2, another weight matrix in the RNN. The resultant output vectors are then added. This added vector is then passed through an activation function to add non-linearity to the model. Usually we use Tanh activation function for RNNs. After the activation function, we have essentially gotten the new state (nth) which will be the input for the RNN at the next word. The new state is then multiplied with W3, another weight matrix, to get the final output of the RNN. The final output is usually disregarded until all words have been processed by the RNN.

Intuituve Explanation of RNN

Let’s have a more intuituve understanding of what’s going on here. As we are feeding a sequence of words into the RNN, the state gets updated for each word being input. As a result, the state essentially becomes a representation of all the words which have been processed so far. And since the state get’s updated in a sequential manner, the state will also contain information about the order of the words as well as the words themselves.

To better understand this, let’s take an example sentence of “Deep Learning is hard but fun”. Let’s consider the states at each step as the RNN is processing this sentence. When “Deep” is fed into the RNN, the state contains the representation of just the word “Deep”. Next, when we feed “Learning” into the RNN, it will update the state which had a representation of just “Deep” to now contain a representation of “Deep + Learning”. As the RNN continues to get the words from the sequence, the final state contains the representation of “Deep + Learning + is + hard + but + fun”. If we rearrange the sentence to “Learning Deep is hard but fun”, the final state will be “Learning + Deep + is + hard + but + fun”, which is completely different from the previous one.

The final state of the RNN contains both semantic information of the words in the sentence as well as sequential information regarding the order of the words.

Training RNNs

Now that we have seen how an RNN predicts once we feed it a sequence of words, let’s look into how the RNN trains itself to give meaningful predictions after learning from some labelled text data.

Let’s see how to train an RNN for sentiment analysis. In sentiment analysis, we are given a sentence and we need to predict the sentiment polarity of the sentence. The output will be either 1 (positive polarity) or 0 (negative polarity). This is essentially a text classification task. The data will be in the following format:

The dataset has a list of sentences and their corresponding sentiment scores. We need to train an RNN to give the corresponding sentiment score of a sentence after we feed the sentence into the RNN.

In the case of sequential models, the model is being used multiple times instead of once like traditional neural networks. For a 5 word sentence, an RNN will be taking values from the input to the ouput for a total of 5 times. Because of this phenomenon, the backpropagation will happen multiple times for a single training sample instead of once. The backpropagation will happen over all the steps the RNN has taken, this is also called “Backpropagation through time” (BPTT).

Initially, before the training process begins, we randomly initialize all 3 weight matrices in the RNN (W1, W2 and W3). Then we start feeding the data into the RNN. For a single training sample, we will feed the entire sentence into the RNN, and then we compare the output of the RNN with the target output corresponding to the sentence to calculate the loss for that training sample. Here since this is a binary classification problem, we use binary cross entropy loss function. Then, we backpropagate the loss function through all the steps of the RNN, from the last word to the first word. Let’s visualize this:

Visual animation of backpropagation through time

This explains the training process for a single training sample in the dataset. As we continue training the RNN with all the data in the dataset for multiple iterations, it will eventually learn how to understand semantic information in each word as well as sequential information in the way the words are ordered to successfully predict the sentiment from the sentence. Now that we have seen how an RNN works, let’s discuss some of the shortcomings of this model.

Shortcomings of RNNs

RNNs are better than traditional neural networks at processing sequential information, but there are still certain things RNNs cannot accomplish when it comes to processing sequential data. The shortcomings of RNNs are:

  1. Vanishing Gradient Problem
  2. Long Term Dependencies

Vanishing Gradient Problem

RNNs perform moderately well which short to medium sized sequences. However, when we perform BPTT for very long sequences, the gradients have to go through many weight matrices before reaching the first few steps of the RNN. When this happens, the gradients become very small as the number of weight matrices increase, causing vanishing gradients. This is a common problem in deep learning when the neural network has too many layers. In this case it occurs because fundamentally, when we use an RNN with a very long sequence, it can be considered equivalent to a neural network having many layers.

Hence we cannot effectively use RNNs when we are dealing with extremely long sequences because the RNN’s weight matrices stop getting updated significantly after a certain number of steps.

Long Term Dependencies

In sentences, sometimes the semantic links between certain words are very far apart. Consider the sentence: “Deep Learning is a upcoming field in today’s technology but it is very hard”. In this sentence, “hard” refers to “Deep Learning” but these are very far apart. In RNNs since the state updation happens at every step, it is very hard for it to capture semantic patterns between words which are very far apart. By the time the RNN reaches the word “hard”, it would not be considering “Deep Learning” as much as the recent words it has processed. Even if you give the RNN lots of training examples with long term dependencies, the architecure limits the ability of RNNs to capture long term dependencies.

In the world of Deep NLP, researchers have come up with better sequential models which solve these problems effectively called LSTMs and GRUs. These are more advanced sequential models and we will learn about them in future articles. Thank’s for reading!