08.Recurrent Neural Networks

# Recurrent Neural Networks We've only discussed static datasets, unchanging over time till now: - **Static** datasets are unchanging over time. - **Dynamic** datasets change over time and require special handling. To handle **Dynamic Datasets** two main approaches exists: - **Memoryless models**: take only the current input into account, without memory or retention of past inputs or outputs. - Autoregressive Models - Feedforward Neural Networks - **Models with memory**: Recurrent Neural Networks Recurrent Neural Networks are a type of neural network designed for sequential data processing: - time series - natural language - audio Their "recurrent" component, which allows for operations consistency across the input sequence. A RNN has a "context" network which acts as a memory. In principle this recurrent part of the network acts as an infinite memory. This is achieved through hidden states that effectively 'remember' past inputs. How to implement this in a neural network? An hidden layer stores recurrent units or neurons that recall past inputs, essential for sequential data processing. **There are hidden states that keep evolving with each input in the sequence**. They get 'updated' to reflect the information they've accumulated from all past inputs up to that point. ![](images/e28a04dca244d4509fcb23a9f26f0964.png) ## BPTT Backpropagation Through Time (BPTT) is an adaptation of the normal backpropagation algorithm, is designed for sequential data. In usual neural networks, backpropagation calculates the gradient of loss function for each weight, determining how a change in weight influences the total error, and adjusts the weights to minimize this error. But in the unique structure of RNNs, not just the current input but also previous ones influence the output: the temporal dependencies hence introduced complicate the training process. > "When you have a loop, you can always unroll the iterations of them" **BPTT** unrolls the RNN over the sequence's time. Each time step is treated as a distinct layer in a deep network, transforming the network into a standard deep neural framework with each layer representing a time step in the initial RNN. Here's a step-by-step overview of BPTT: 1. **Forward Pass**: Just like in standard backpropagation, BPTT begins with a forward pass through the network. For an RNN, this means processing the entire sequence from the first time step to the last, updating the hidden states based on the recurrent connections at each step. 2. **Compute Loss** 3. **Backward Pass**: Next, the gradients of this loss are propagated backward through the network. Unlike standard backpropagation, this involves going backward through each time step, accumulating gradients as we move back in time. **Since the same weights are used at each time step, the gradients from all time steps are summed up.**\* 4. **Update Weights** \* Let's explain better "the same weights are used at each time step, the gradients from all time steps are summed up." : the same weights are used at each time step and the loss gradients calculated at each time step are summed up. This **cumulative gradient** is then used to update the weights. But it's clear that there are a potential issue with this **cumulated gradient**: if the sequence is very long, the gradients can become very small (vanish) or very large (explode). This problem is particularly pronounced in RNNs with extensive recurrent parts. So at the end: - In scenarios where the task doesn't necessitate retaining long-term context, RNNs are often preferred due to their simplicity and computational efficiency. - If the task needs long-term context, **Long Short-Term Memory networks** (LSTMs) are the superior option. ## LSTM **Long Short-Term Memory (LSTM) Networks** are specifically designed to address the vanishing gradient problem. LSTMs achieve this through a specialized architecture that includes memory cells and gates, allowing them to maintain a **balance between long-term memory and short-term** inputs. The fundamental building block of an LSTM network is the memory cell. In an LSTM network, each unit in the hidden layer is replaced by an LSTM unit or cell. These cells are designed to store information for long periods. Each cell has the capability to add or remove information, controlled through structures called gates. There are three types of **Gates in LSTM**: - **Input Gate**: Determines how much of the new information should be added to the cell state. - **Forget Gate**: Decides what information should be discarded from the cell state. - **Output Gate**: Determines the next hidden state, which is a filtered version of the cell state and will be used for predictions and transferring information to the next time step. To summarize: - The gates are not physical gates but mathematical functions that regulate the flow of information in and out of the cell state of each LSTM unit - The idea is to have some gate mechanism where it's possible to "write", "keep" or "read" information in the memory cells whenever the gate is turned on - These gates operate as filters, selectively allowing information to pass through or be modified, which makes LSTMs capable of handling complex tasks like language translation, where understanding context and remembering relevant information over time is crucial. A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by [Cho, et al. (2014)](http://arxiv.org/pdf/1406.1078v3.pdf). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular. ### BiLSTM Bidirectional Long Short Term Memory (BiLSTM) Neural Network ![](images/486a74f54a20582e0ff8d4ddcb5edf24.png) BiLSTM, or Bidirectional Long Short-Term Memory, is an extension of the traditional LSTM with a **bidirectional processing**: unlike standard LSTMs which process data in a forward direction (from past to future), BiLSTMs process data in both forward and backward directions. This means that for any given point in the input sequence, the BiLSTM has contextual information from both the past and the future relative to that point. The BiLSTM simply does this using two LSTM layers that are run in parallel: - one processes the data from start to end (forward pass) - and the other processes the data from end to start (backward pass). The outputs of these two layers are then combined at each time step.