Long Short-Term Memory Networks (LSTM)

Slide Note

Long Short-Term Memory Networks (LSTM) address the vanishing/exploding gradient problem in deep neural networks by utilizing memory cells and gating mechanisms. By controlling the flow of information, LSTMs can capture long-term dependencies, making them effective for sequential data tasks. The transformation of traditional RNNs to LSTMs involves incorporating input, output, and forget gates, as well as memory cell states. These architectural innovations have revolutionized the field of deep learning, enabling more effective learning and retention of sequential patterns.

kenric Follow

Uploaded on Mar 18, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Deep Learning Long Short-Term Memory Networks (LSTM)

LSTM Motivation y3 Cost Remember how we update an RNN? softmax wy wh wh wh h0 h1 h2 h3 wx wx wx x1 x2 x3 cat The sat [slides from Catherine Finegan-Dollak]

The Vanishing Gradient Problem Deep neural networks use backpropagation. Back propagation uses the chain rule. The chain rule multiplies derivatives. Often these derivatives between 0 and 1. As the chain gets longer, products get smaller until they disappear. Wolfram|Alpha Derivative of sigmoid function

Or do they explode? with products becoming larger and larger as the chain becomes longer and longer, causing overlarge updates to parameters. This is the exploding gradient problem. With gradients larger than 1, you encounter the opposite problem

Vanishing/Exploding Gradients Are Bad. If we cannot backpropagate very far through the network, the network cannot learn long-term dependencies. My dog [chase/chases] squirrels. vs. My dog, whom I adopted in 2009, [chase/chases] squirrels. 6

LSTM Solution Use memory cell to store information at each time step. Use gates to control the flow of information through the network. Input gate: protect the current step from irrelevant inputs Output gate: prevent the current step from passing irrelevant outputs to later steps Forget gate: limit information passed from one cell to the next

Transforming RNN to LSTM ??=?(? ? 1+ ????) wh h0 u1 wx x1

Transforming RNN to LSTM c0 wh h0 u1 wx x1

Transforming RNN to LSTM ??= ?? ?? 1+ ?? ?? f1 + c0 c1 wh i1 h0 u1 wx x1

Transforming RNN to LSTM ??= ?? ?? 1+ ?? ?? f1 + c0 c1 wh i1 h0 u1 wx x1

Transforming RNN to LSTM ??= ?? ?? 1+ ?? ?? f1 + c0 c1 wh i1 h0 u1 wx x1

Transforming RNN to LSTM ??=?(? ? ? 1+ ?????) f1 f1 + c0 c1 whf wh i1 h0 h0 u1 wxf wx x1 x1

Transforming RNN to LSTM ??=?(? ? ? 1+ ?????) f1 + c0 c1 whi wh i1 i1 h0 h0 u1 wxi wx x1 x1

Transforming RNN to LSTM ?= ?? tanh?? f1 + c0 c1 tanh wh i1 o1 h0 u1 h1 wx x1

LSTM for Sequences + + + f2 c2 f1 f2 c0 c1 c2 tanh tanh tanh wh wh wh u2 i2 h2 o2 h0 u1 i1 u2 i2 h1 h2 o1 o2 wx wx wx x2 x1 x2 sat cat The

LSTM Applications http://www.cs.toronto.edu/~graves/handwriting.html Language identification (Gonzalez-Dominguez et al., 2014) Paraphrase detection (Cheng & Kartsaklis, 2015) Speech recognition (Graves, Abdel-Rahman, & Hinton, 2013) Handwriting recognition (Graves & Schmidhuber, 2009) Music composition (Eck & Schmidhuber, 2002) and lyric generation (Potash, Romanov, & Rumshisky, 2015) Robot control (Mayer et al., 2008) Natural language generation (Wen et al. 2015) (best paper at EMNLP) Named entity recognition (Hammerton, 2003)