Contextual Embeddings in Natural Language Processing

introduction to transformers n.w
1 / 56
Embed
Share

Dive into the evolution of NLP with transformers, attention mechanisms, and contextual embeddings. Learn how contextual embeddings address the limitations of static word embeddings, providing a nuanced understanding of word meaning in varying contexts.

  • NLP
  • Transformers
  • Embeddings
  • Attention
  • Language Modeling

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Introduction to Transformers Transformers

  2. LLMs are built out of transformers Transformer: a specific kind of network architecture, like a fancier feedforward network, but based on attention

  3. A very approximate timeline 1990 Static Word Embeddings 2003 Neural Language Model 2008 Multi-Task Learning 2015 Attention 2017 Transformer 2018 Contextual Word Embeddings and Pretraining 2019 Prompting

  4. Attention Transformers

  5. Instead of starting with the big picture Let's consider the embeddings for an individual word from a particular layer long and thanks for all Next token long and thanks for all Next token Language Modeling Head logits logits logits logits logits Language Modeling Head U U U U U logits logits logits logits logits U U U U U Stacked Transformer Stacked Transformer Blocks Blocks x1 x1 x2 x2 x3 x4 x5 x3 x4 x5 + + + + + + 1 1 2 3 4 5 Input Encoding Input Encoding + + + + 2 3 4 5 E E E E E E E E E E So So long long and and thanks thanks for Input tokens for Input tokens

  6. Problem with static embeddings (word2vec) They are static! The embedding for a word doesn't reflect how its meaning changes in context. The chicken didn't cross the road because it was too tired What is the meaning represented in the static embedding for "it"?

  7. Contextual Embeddings Intuition: a representation of meaning of a word should be different in different contexts! Contextual Embedding: each word has a different vector that expresses different meanings depending on the surrounding words How to compute contextual embeddings? Attention

  8. Contextual Embeddings The chicken didn't cross the road because it What should be the properties of "it"? The chicken didn't cross the road because it was too tired The chicken didn't cross the road because it was too wide At this point in the sentence, it's probably referring to either the chicken or the street

  9. Intuition of attention Build up the contextual embedding from a word by selectively integrating information from all the neighboring words We say that a word "attends to" some neighboring words more than others

  10. Intuition of attention: test

  11. Attention definition A mechanism for helping compute the embedding for a token by selectively attending to and integrating information from surrounding tokens (at the previous layer). More formally: a method for doing a weighted sum of vectors.

  12. Attention is left-to-right a1 a2 a3 a4 a5 Self-Attention Layer attention attention attention attention attention x1 x2 x3 x4 x5

  13. 10.1 THE TRANSFORMER: A SELF-ATTENTION NETWORK 5 10.1 THE TRANSFORMER: A SELF-ATTENTION NETWORK 5 Figure10.2 each element of thesequence, themodel attends to all theinputs up to, and including, the current one. Unlike RNNs, the computations at each time step are independent of all the Information flow in acausal (or masked) self-attention model. In processing Figure10.2 each element of thesequence, themodel attendsto all theinputsup to, and including, the current one. Unlike RNNs, the computations at each time step are independent of all the other stepsandthereforecanbeperformedinparallel. We vegiventheintuitionof self-attention (asaway tocomputerepresentationsof a word at agiven layer by integrating information fromwordsat thepreviouslayer) and we vedefined context asall theprior wordsin theinput. Let snow introduce theself-attention computation itself. Thecoreintuition of attention istheideaof comparing an item of interest to a collection of other itemsinaway that revealstheir relevanceinthecurrent context. Inthecaseof self-attention for language, theset of comparisonsaretoother words (or tokens) withinagivensequence. Theresult of thesecomparisonsisthenusedto computeanoutput sequencefor thecurrent input sequence. For example, returning to Fig. 10.2, thecomputation of a3is based on aset of comparisons between the input x3anditsprecedingelementsx1andx2,andtox3itself. How shall we compare words to other words? Since our representations for wordsarevectors, we ll makeuseof our old friend thedot product that weused for computing word similarity in Chapter 6, and also played arolein attention in Chapter 9. Let srefer to theresult of thiscomparison between wordsi and j asa score (we ll be updating this equation to add attention to the computation of this score): Information flow inacausal (or masked) self-attention model. In processing other stepsandthereforecanbeperformedinparallel. 10.1.3 Self-attention moreformally 10.1.3 Self-attention moreformally We vegiventheintuitionof self-attention(asaway tocomputerepresentationsof a wordat agiven layer by integrating information fromwordsat thepreviouslayer) and we vedefined context asall theprior wordsin theinput. Let snow introduce theself-attention computationitself. Thecoreintuition of attention istheideaof comparing anitemof interest toa collection of other itemsinaway that revealstheir relevanceinthecurrent context. Inthecaseof self-attention for language, theset of comparisonsaretoother words (or tokens) withinagivensequence. Theresult of thesecomparisonsisthenusedto computeanoutput sequencefor thecurrent input sequence. For example, returning to Fig. 10.2, thecomputation of a3isbased on aset of comparisons between the input x3anditsprecedingelementsx1andx2,andtox3itself. How shall we compare words to other words? Since our representations for wordsarevectors, we ll makeuseof our old friend thedot product that weused for computing word similarity in Chapter 6, and also played arolein attention in Chapter 9. Let srefer to theresult of thiscomparison between wordsi and j asa score (we ll beupdating this equation to add attention to thecomputation of this score): with a softmax to create a vector of weights, aij, that indicates the proportional relevanceof eachinput totheinput element i that isthecurrent focusof attention. Verson1: score(xi,xj) = xi xj (10.4) Simplified version of attention: a sum of prior words weighted by their similarity with the current word Given a sequence of token embeddings: x1 x2 x3x4x5x6x7xi Produce: ai = a weighted sum of x1 through x7 (and xi) Weighted by their similarity to xi Verson1: score(xi,xj) = xi xj aij = softmax(score(xi,xj)) 8j exp(score(xi,xj)) Pi k=1exp(score(xi,xk)) Theresult of adot product isascalar valueranging from to , thelarger thevaluethemoresimilar thevectorsthat arebeingcompared. Continuingwithour example, thefirst step in computing y3would beto compute threescores: x3 x1, x3 x2andx3 x3. Thentomakeeffectiveuseof thesescores, we ll normalizethem (10.4) i (10.5) Theresult of adot product isascalar valueranging from to , thelarger thevaluethemoresimilar thevectorsthat arebeingcompared. Continuingwithour example, thefirst step in computing y3would beto computethreescores: x3 x1, x3 x2andx3 x3. Thentomakeeffectiveuseof thesescores, we ll normalizethem with a softmax to create a vector of weights, aij, that indicates the proportional relevanceof eachinput totheinput element i that isthecurrent focusof attention. tothosewords. Giventheproportional scoresina, wegenerateanoutput valueaiby summing = 8j i (10.6) Of course,thesoftmaxweight will likelybehighest forthecurrent focuselement i, since vecxiis very similar to itself, resulting in a high dot product. But other contextwordsmayalsobesimilar toi,andthesoftmax will alsoassignsomeweight aij = softmax(score(xi,xj)) 8j exp(score(xi,xj)) Pi k=1exp(score(xi,xk)) i (10.5) = 8j i (10.6) Of course,thesoftmaxweight will likelybehighestforthecurrentfocuselement i, since vecxiis very similar to itself, resulting in a high dot product. But other contextwordsmayalsobesimilar toi,andthesoftmaxwill alsoassignsomeweight tothosewords. Giventheproportional scoresina, wegenerateanoutput valueaiby summing

  14. Intuition of attention: test x1 x2 x3 x4 x5 x6 x7 xi

  15. An Actual Attention Head: slightly more complicated High-level idea: instead of using vectors (like xi and x4) directly, we'll represent 3 separate roles each vector xiplays: query: As the current element being compared to the preceding inputs. key: as a preceding input that is being compared to the current element to determine a similarity value: a value of a preceding element that gets weighted and summed

  16. Attention intuition query x1 x2 x3 x4 x5 x6 x7 xi values

  17. Intuition of attention: query x1 x2 x3 x4 x5 x6 x7 xi keys k v k v k v k v k v k v k v k v values

  18. An Actual Attention Head: slightly more complicated We'll use matrices to project each vector xiinto a representation of its role as query, key, value: query: WQ key: WK value: WV

  19. An Actual Attention Head: slightly more complicated Given these 3 representation of xi To compute similarity of current element xi with some prior element xj We ll use dot product between qiand kj. And instead of summing up xj , we'll sum up vj

  20. Final equations for one attention head

  21. Calculating the value of a3 a3 Output of self-attention 6. Sum the weighted value vectors 5. Weigh each value vector 3,1 3,2 3,3 4. Turn into i,jweights via softmax 3. Divide score by dk dk dk dk 2. Compare x3 s query with the keys for x1, x2, and x3 k k k Wk Wk Wk 1. Generate key, query, value vectors q q q Wq Wq Wq Wv Wv Wv x1 x2 x3 v v v

  22. Actual Attention: slightly more complicated Instead of one attention head, we'll have lots of them! Intuition: each head might be attending to the context for different purposes Different linguistic relationships or patterns in the context

  23. Multi-head attention

  24. Summary Attention is a method for enriching the representation of a token by incorporating contextual information The result: the embedding for each word will be different in different contexts! Contextual embeddings: a representation of word meaning in its context. We'll see in the next lecture that attention can also be viewed as a way to move information from one token to another.

  25. Attention Transformers

  26. The Transformer Block Transformers

  27. Reminder: transformer language model long and thanks for all Next token Language Modeling Head logits logits logits logits logits U U U U U Stacked Transformer Blocks x1 x2 x3 x4 x5 + + + + + 1 2 3 4 5 Input Encoding E E E E E So long and thanks for Input tokens

  28. The residual stream: each token gets passed up and modified

  29. We'll need nonlinearities, so a feedforward layer

  30. Layer norm: the vector x xi is normalized twice

  31. Layer Norm Layer norm is a variation of the z-score from statistics, applied to a single vec- tor in a hidden layer

  32. Putting together a single transformer block

  33. A transformer is a stack of these blocks so all the vectors are of the same dimensionality d Block 2 Block 1

  34. Residual streams and attention Notice that all parts of the transformer block apply to 1 residual stream (1 token). Except attention, which takes information from other tokens Elhage et al. (2021) show that we can view attention heads as literally moving information from the residual stream of a neighboring token into the current stream .

  35. The Transformer Block Transformers

  36. Parallelizing Attention Computation Transformers

  37. Parallelizing computation using X X For attention/transformer block we've been computing a single output at a single time step i in a single residual stream. But we can pack the N tokens of the input sequence into a single matrix X of size [N d]. Each row of X is the embedding of one token of the input. X can have 1K-32K rows, each of the dimensionality of the embedding d (the model dimension)

  38. QKT Now can do a single matrix multiply to combine Q and KT

  39. Parallelizing attention Scale the scores, take the softmax, and then multiply the result by V resulting in a matrix of shape N d An attention vector for each input token

  40. Masking out the future What is this mask function? QKT has a score for each query dot every key, including those that follow the query. Guessing the next word is pretty simple if you already know it!

  41. Masking out the future Add to cells in upper triangle The softmax will turn it to 0

  42. Another point: Attention is quadratic in length

  43. Attention again

  44. Parallelizing Multi-head Attention

  45. Parallelizing Multi-head Attention or

  46. Parallelizing Attention Computation Transformers

  47. Input and output: Position embeddings and the Language Model Head Transformers

  48. Token and Position Embeddings The matrix X (of shape [N d]) has an embedding for each word in the context. This embedding is created by adding two distinct embedding for each input token embedding positional embedding

  49. Token Embeddings Embedding matrix E has shape [|V | d ]. One row for each of the |V | tokens in the vocabulary. Each word is a row vector of d dimensions Given: string "Thanks for all the" 1. Tokenize with BPE and convert into vocab indices w = [5,4000,10532,2224] 2. Select the corresponding rows from E, each row an embedding (row 5, row 4000, row 10532, row 2224).

  50. Position Embeddings There are many methods, but we'll just describe the simplest: absolute position. Goal: learn a position embedding matrix Epos of shape [1 N ]. Start with randomly initialized embeddings one for each integer up to some maximum length. i.e., just as we have an embedding for token fish, we ll have an embedding for position 3 and position 17. As with word embeddings, these position embeddings are learned along with other parameters during training.

Related


More Related Content