Recurrent Neural Networks
April 15, 2020 — 18 min
Recurrent neural networks are very famous deep learning networks which are applied to sequence data: time series forecasting, speech recognition, sentiment classification, machine translation, Named Entity Recognition, etc..
The use of feedforward neural networks on sequence data raises two majors problems:
 Input & outputs can have different lengths in different examples
 MLPs do not share features learned across different positions of the data sample
In this article we will discover the mathematics behind the success of RNNs as well as some special types of cells such as LSTMs and GRUs. We will finally dig into the encoderdecoder architectures combined with attention mechanisms.
Table of content
 Notation
 RNN model
 Different types of RNNs
 Advanced types of cells
 Encoder & Decoder architecture
 Attention mechanisms
1. Notation
As an illustration, we will consider the task of Named Entity Recognition which consists of locating and identifying the named entity such as proper names:
We denote:
 $x^{(i)}_t$: the $t^{th}$ element of the sequence example $x^{(i)}$
 $(x^{(i)})$: the ith input sequence $x^{(i)}$
 $N_x^{(i)}$: the length of $x^{(i)}$
 $y^{(i)}_t$: the $t^{th}$ element of the output $y^{(i)}$
 $(y^{(i)})$: the ith output sequence $y^{(i)}$
 $N_y^{(i)}$: the length of $y^{(i)}$
When dealing with nonnumerical data, text for instance, it is very important to encode it to numerical vectors: this operation is called embedding
. One of the most famous ways to encode text is Bert which was developed by Google.
2. RNN model
RNNs represent a special case of neural networks where the parameters of the model
as well as the operations performed are the same throughout the architecture.
The network performs the same task for each element in a sequence whose output depends on the input and the previous state of the memory
.
The graph below shows a neural network of neurons
having a single layer of hidden memory:
Equations
The variables in the architecture are:
 $x_t$: the input at time t
 $h_t$: the state of the memory at time t
 $y_t$: the output at time t
where:
$h_t=\phi(U.x_t+W.h_{t1})$ and $y_t = \psi(V.h_t)$
$h_{1}$ is randomly initialized, $\phi$ and $\psi$ are nonlinear functions, $U$, $V$, and $W$ are parameters
of the various linear regressions, preceding the nonlinear activations.
It is important to note that they are the same throughout the architecture.
Applications
Recurrent neural networks have significantly improved sequential models, in particular :
 NLP Tasks, Modeling and Text Generation
 Translation machine
 Voice recognition
We summarize the above applications in the following table:
Application  Description  Input  Output  Example 

NLP  Text modelisation and generation  $x_t$ is the tth embedded word of a given text  $(y_t)$ can be the answer to the question $(x_t)$  Chatbots, text classification, … 
Translation Machine  Necessity to go through the entire sentence before its translation  $x_t$ is the tth embedded word of the native text  $(y_t)$ is the translation of the native text $(x_t)$  Google Translation, DeepL, … 
Speech recognition  Recognize the content of a given recording  $x_t$ is the value of the audio file’s spectogram at the instant t  $(y_t)$ is the text or the command extracted from $(x_t)$  Google Home, Alexa, … 
Learning algorithm
As in classical neural networks, learning in the case of recurrent networks is done by optimizing a cost function with respect to $U$, $V$ and $W$. In other words, we aim to find the best parameters that give the best prediction/approximation $\hat{y_i}$, starting from the input $x_i$, of the real value $y_i$.
For this, we define an objective function called the loss function
and denoted J
which quantifies the distance between the real and the predicted values on the overall training set.
We minimize J following two major steps:
Forward Propagation
: we propagate the data through the network either in entirely or in batches, and we calculate the loss function on this batch which is nothing but the sum of the errors committed at the predicted output for the different rows.Backward Propagation Through Time
: consists of calculating the gradients of the cost function with respect to the different parameters, then apply a descent algorithm to update them. It is called BPTT, since the gradients at each output depend both on the elements of the same instant and the state of the memory at the previous instant.
We iter the same process a number of times called epoch number
.
After defining the architecture, the learning algorithm is written as follows:
 Initialization of the model parameters, a step equivalent to injecting noise into the model.
For
k=1,2…N: (N is the number of epochs)
Perform
forward propagation
:
 $\forall i$, Compute the predicted value of $(x_t^{(i)})$ through the neural network: $(\hat{y}_t^{(i)})$
 Evaluate the function : $J(\theta)=\frac{1}{m}\sum_{i=1}^m\sum_{t=1}^{N_y^{(i)}} \mathcal{L}(\hat{y}_t^{(i)}, y_t^{(i)})$ where m is the size of the training set, θ the model parameters and $\mathcal{L}$ the cost${}^{(*)}$ function
Perform
backpropagation through time
:
 Apply a descent method to update the parameters : $\theta=:G(\theta)$
${}^{(*)}$ The cost function $\mathcal{L}$ evaluates the distances between the real and predicted value on a single point.
Forward propagation
Let us consider the prediction of the output of a single sequence, denoted $(x^{(j)}_t)$, through the neural network At each moment $t$, we compute:
$h^{(j)}_t=\phi(U.x_t^{(j)}+W.h_{t1}^{(j)})\text{ and } y_t^{(j)} = \psi(V.h_t^{(j)})$
Until reaching the end of the sequence. Again, the parameters $U$, $W$ and $V$ remain the same all along the neural network. When dealing with a $m$row data set, repeating these operations separately for each line is very costly. Hence, we truncate the dataset in order to have sequences described in the same timeline, i.e $\forall i, N_x^{(i)}=N_x$ and $N_y^{(i)}=N_y$ and We can use linear algebra to parallelize it as follows:
$H^{(j)}_t=\phi(U.X_t^{(j)}+W.H_{t1}^{(j)})\text{ and } Y_t^{(j)} = \psi(V.H_t^{(j)})$
Where
$H^{(j)}_t=\begin{bmatrix} h^{(j)}_t \end{bmatrix}_{j\in [1,m]} \\ Y_t^{(j)}=\begin{bmatrix} y_t^{(j)} \end{bmatrix}_{j\in [1,m]}$
Backpropagation Through Time
The backpropagation is the second step of the learning, which consists of injecting the error
committed in the prediction (forward) phase into the network and update its parameters to perform better on the next iteration
. Hence, the optimization of the function $J$, usually through a descent method.
For simplification reasons we will consider that: $h_t=U.x_t+W.h_{t1}$ and $y_t = V.h_t$
For a given sequence $(x_t)$, we perform the forward propagation and compute the following cost:
$L(U,W,V)=\sum_{t=1}^{N_y} \mathcal{L}(\hat{y}_t, y_t)$
and compute the gradient of L defined as follows:
$\nabla L(U,W,V)={}^T \begin{bmatrix} \frac{\partial L}{\partial U} & \frac{\partial L}{\partial W} & \frac{\partial L}{\partial V} \end{bmatrix}$
$\frac{\partial L}{\partial V}=\sum_{t=1}^{N_y} \frac{\partial \mathcal{L}(\hat{y}_t, y_t)}{\partial V}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}\frac{d\hat{y}_t}{dV}$
$\rightarrow \frac{\partial L}{\partial V}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.h_t$
and
$\frac{\partial L}{\partial W}=\sum_{t=1}^{N_y} \frac{\partial \mathcal{L}(\hat{y}_t, y_t)}{\partial W}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.\frac{d\hat{y}_t}{dh_t}.\frac{dh_t}{dW}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.V.\frac{dh_t}{dW}$
$\frac{\partial L}{\partial U}=\sum_{t=1}^{N_y} \frac{\partial \mathcal{L}(\hat{y}_t, y_t)}{\partial U}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.\frac{d\hat{y}_t}{dh_t}.\frac{dh_t}{dU}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.V.\frac{dh_t}{dU}$
We know that:
$h_{t+1}=U.x_{t+1}+W.h_{t}\rightarrow \frac{\partial h_{t+1}}{\partial h_t}=W^T\rightarrow \frac{\partial h_{N}}{\partial h_t}=(W^T)^{Nt}$
Then,
$\frac{dh_t}{dW}=\sum_{k=1}^t(W^T)^{tj}h_j$ $\frac{dh_t}{dU}=\sum_{k=1}^t(W^T)^{tj}x_j$
Finally:
$\frac{\partial L}{\partial U}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.V.\sum_{k=1}^t(W^T)^{tj}x_j$
$\frac{\partial L}{\partial W}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.V.\sum_{k=1}^t(W^T)^{tj}h_j$
$\frac{\partial L}{\partial V}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.h_t$
We can now apply a descend method as detailed is my previous article.
Memory problem
There are several fields where we are interested in predicting the evolution of a time serie based on its history: music, finance, emotions…etc.
The intrinsic recurrent networks described above, called ‘Vanilla’, suffer from a weak memory unable to take into account several elements of the past in the prediction of the future.
With this in mind, various extensions of the RNNs have been designed to trim the internal memory: bidirectional neural networks, LSTM cells, attention mechanisms…etc Memory enlargement can be crucial in certain fields such as finance where one seeks to memorize as much history as possible in order to predict a financial series.
The learning phase in RNN might also suffer from gradient vanishing
or gradient exploding
problems since the gradient of the cost function includes the term $(W^T)^{tj}$ which affects its memorizing capacity.
3. Different types of RNNs
There are several extensions for classic or ‘Vanilla’ recurrent neural networks, these extensions have been designed to increase the memory capacity of the network along with the features extraction capacity.
The illustration below summarizes the different extensions:
There exist other types of RNNs that have a specifically designed hidden layer which we will discuss in the next chapter.
4. Advanced types of cells
GRU
GRU (Gated Recurrent Unit) cells allow the recurrent network to save more historical information for a better prediction. It introduces an update gate
which determines the quantity of information to keep from the past as well as a reset gate
which sets the quantity of information to forget.
The graph bellow schematizes the GRU cell:
Equations
We define the equations in the GRU cell as follows:
 Update gate: $z_t=\phi(W^{zx}x_{t}+W^{zh}h_{t1})$
 Reset gate: $r_t=\phi(W^{rx}x_{t}+W^{rh}h_{t1})$
 Current memory content: $c_t=tanh(W^{cx}x_{t}+W^{ch}r_th_{t1})$
 Memory state: $h_t=(1z_t)h_{t1}+z_t.c_t$
$\phi$ is a nonlinear integer function and the parameters $W^{.}$ are learned by the model.
LSTM
LSTMs (Long Short Term Memory) were also introduced to overcome the problem of short memory, they have 4 times more memory than Vanilla RNNs. This model uses the notion of gates and has three :
 Input gate i: controls the flow of incoming information.
 Forget gate f: Controls the amount of information from the previous memory state.
 Output gate o: controls the flow of outgoing information
The graph below shows the operation of the LSTM cell:
When the input and output doors are closed, activation is blocked in the memory cell.
Equations
We define the equations in the LSTM cell as follows:
 Input gate: $i_t=\phi(W^{ix}x_t+W^{ih}h_{t1})$
 Forget gate: $f_t=\phi(W^{fx}x_t+W^{fh}h_{t1})$
 Output gate: $o_t=\phi(W^{ox}x_t+W^{oh}h_{t1})$
 Input node: $g_t=tanh(W^{gx}x_t+W^{gh}h_{t1})$
 Hidden node: $s_t=s_{t1}.f_t+g_t.i_t$
 Memomy state: $h_t=tanh(s_t).o_t$
$\phi$ is a nonlinear integer function and the parameters $W^{.}$ are learned by the model.
Pros & Cons
We can summarize the advantages and disadvantages of LSTM cells in 4 main points:

Advantages
 They are able to model longterm sequence dependencies.
 They are more robust to the problem of short memory than ‘Vanilla’ RNNs since the definition of the internal memory is changed from $h_t=f(Ux_t+Wh_{t1})$ to $h_t=tanh(s_{t1}.f_t+g_t.i_t).o_t$

Disadvantages
 They increase the computing complexity compared to the RNN with the introduction of more parameters to learn.
 The memory required is higher than the one of ‘Vanilla’ RNNs due to the presence of several memory cells.
5. Encoder & Decoder architecture
It is a sequential model consisting of two main parts:
Encoder
: the first part of the model processes the sequence and then returns at the end an encoding vector of the whole series calledcontext vector
which summarizes the information of the different inputs.Decoder
: the context vector is then taken as the input to the decoder in order to make the predictions.
The diagram below illustrates the architecture of the model :
The encoder can be considered as a dimension reduction tool, matter of fact, the context vector $e_n$ is nothing else than the encoding of the input vectors $(in_0, in_1,...in_n)$, the sum of the sizes of these vectors is much larger than that of en, hence the notion of dimension reduction.
6. Attention mechanisms
Attention mechanisms were introduced to address the problem of memory limitation, and mainly answer the following two questions:
 What weight (importance) $α_j$ is given to each output $e_j$ of the encoder ?
 How can we overcome the limited memory of the encoder in order to be able to ‘remember’ more of the encoding process?
The mechanism inserts itself between the encoder and the decoder and helps the decoder to significantly select the encoded inputs that are important for each step of the decoding process $out_i$ as follows:
Mathematical formalism
Keeping the same notation as before, we set $α_{i,j}$ as the attention given by the output $i$, noted $out_i$, to the vector $e_j$. The attention is computed via a neural network which takes as inputs the vectors $(e_0, e_1, ..., e_n)$ and the previous memory state $h_{i1}$, it is given by :
$\alpha_{i,j}=\frac{exp(b_{i,j})}{\sum_{j=0}^nexp(b_{i,j})}$
where $b_{i,j}=a(e_j, h_{i1})=v\tanh(w_1e_j+w_2h_{i1}))$, $a$ is the neural network and $w_{1/2}$ are learned by the architecture.
Finally, the input of the decoder is the weighted linear combination of the encoder’s outputs:
$c_i=\sum_{j=0}^n\alpha_{i,j}e_j$
Considering the definition of $\alpha_{i,j}$, we have:
$\forall i,~\sum_{j=0}^n\alpha_{i,j}=1$
Application: Translation Machine
The use of an attention mechanism makes it possible to visualize and interpret
what the model is doing internally, especially at the time of prediction.
For example, by plotting a ‘heatmap’ of the attention matrix of a translation system, we can see the words in the first language on which the model focuses to translate each word into the second language:
As illustrated above, when translating a word into English, the system focuses in particular on the corresponding French word.
Overlay of LSTM & Attention Mechanism
It is relevant to combine the two methods to improve internal memory, since the first one allows more elements of the past to be taken into account and the second one chooses to pay careful attention to them at the time of prediction.
The output $c_t$ of the attention mechanism is the new input of the LSTM cell, so the system of equations becomes as follows :
 Input gate: $i_t=\phi(W^{ic}c_t+W^{ih}h_{t1})$
 Forget gate: $f_t=\phi(W^{fc}c_t+W^{fh}h_{t1})$
 Output gate: $o_t=\phi(W^{oc}c_t+W^{oh}h_{t1})$
 Input node: $g_t=tanh(W^{gc}c_t+W^{gh}h_{t1})$
 Hidden output: $s_t=s_{t1}.f_t+g_t.i_t$
 Prediction output: $h_t=tanh(s_t).o_t$
$\phi$ is a nonlinear integer function and the parameters W and b are learned by the model.
References
 Z.Lipton, J.Berkowitz, C.Elkan, A Critical Review of Recurrent Neural Networks for Sequence Learning, arXiv: 1506.00019v4, 2015.
 H.Salehinejad, S.Sankar, J.Barfett, E.Colak, S.Valaee, Recent Advances in Recurrent Neural Networks, arXiv: 1801.01078v3, 2018.
 Y.Baveye, C.Chamaret, E.Dellandréa, L.Chen, Affective Video Content Analysis: A Multidisciplinary Insight, HAL Id: hal01489729, 2017.
 A.Azzouni, G.Pujolle, A Long ShortTerm Memory Recurrent Neural Network Framework for Network Traffic Matrix Prediction, arXiv: 1705.05690v3, 2017.
 Y.G.Cinar, H.Mirisaee, P.Goswami, E.Gaussier, A.AitBachir, V.Strijov, Time Series Forecasting using RNNs: an Extended Attention Mechanism to Model Periods and Handle Missing Values, arXiv: 1703.10089v1, 2017.
 K.Xu, L.Wu, Z.Wang, Y.Feng, M.Witbrock, V.Sheinin, Graph2Seq: Graph to Sequence Learning with AttentionBased Neural Networks, arXiv: 1804.00823v3, 2018.
 Rose Yu, Yaguang Li, Cyrus Shahabi, Ugur Demiryurek, Yan Liu, Deep Learning: A Generic Approach for Extreme Condition Traffic Forecasting, Universit’e de Californie Sud, 2017.