Understanding Transformer Architecture for Students

1 / 37

Embed Share

Explore the workings of the Transformer model, including Encoder, Decoder, Multi-Head Attention, and more. Dive into examples and visual representations to grasp key concepts easily. Perfect for students eager to learn about advanced neural network structures.

zeddi Follow

Uploaded on Jun 04, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Transformer Transformer overview Encoder Decoder

Transformer architecture

How does the transformer work? EE I am a student encoder decoder

Decoder Encoder I am a Student. Decoder Encode r Decoder Encoder

Encoder Encoder Feed Forward Multi-Head Attention

Multi-head attention Z1 Zh Z2 ... X

Multi-head attention Z X

Scaled Dot-Product Attention Z MatMul Z SoftMax Scaled Dot-Product Attention Scale K Q V Linear Linear Linear MatMul V K Q X

Example 0 1 0 1 0 1 1 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 2 3 0 1 0 0 3 0 X1: 1 X2: 0 0 1 0 1 0 2 1 1 0 1 0 2 1 WQ= WV= 2 X = 0 WK= 0 2 1 X : 1 1 1 1 3 Z 0 1 0 1 0 1 1 1 1 0 0 0 0 4 2 1 4 3 1 0 1 1 0 2 1 1 0 1 0 2 1 = K = XWK= 0 1 Scaled Dot-Product Attention 1 1 0 0 0 0 0 1 1 0 1 1 1 0 1 0 2 1 1 0 1 0 2 1 1 2 2 0 2 1 2 2 3 Q = XWQ= = K Q V Linear Linear Linear 0 0 1 1 2 3 0 1 0 0 3 0 1 0 1 0 2 1 1 0 1 0 2 1 1 2 2 2 8 6 3 0 3 V = XWV= = X

Scaling and SoftMax Z 1 2 2 0 2 1 2 2 3 0 1 1 4 4 0 2 3 = 4 1 2 4 4 MatMul QK?= 16 12 12 10 4 SoftMax 2 4 4 4 4 1.15 2.30 2.30 2.30 9.23 6.92 2.30 6.92 5.77 QK? 12 / 10 = 3 = 16 12 ?? Scale 0.14 0.43 0.90 0.75 0.43 0.09 0.24 QK? = 0.01 0.01 softmax ?? MatMul 1.86 2.00 1.99 6.32 7.81 7.48 1.71 0.27 0.74 QK? V Q K Z = softmax V = ??

Multi-head Attention Linear concat h

h = 3

Encoder Feed Forward Multi-Head Attention

Feed Forward Linear 4XLinear

Encoder architecture Encoder Linear Feed Forward 4XLinear Multi-Head Attention howsam.org |

Encoder Norm & Residual connections Norm & Residual connections Norm Feed Forward Norm ( ) + Multi-Head Attention

Several layers of Encoders Several layers of Encoders r Encode . .. N r Encode r Encode

Whole in one image Whole in one image r Encode . .. N r Encode r Encode

Decoder Decoder Add & Normalize Feed Forward Add & Normalize Multi-Head Attention Q K,V Add & Normalize Masked Multi-Head Attention

Decoder Add & Normalize Feed Forward Add & Normalize Multi-Head Attention Q K,V Add & Normalize Masked Multi-Head Attention

#N Decoder N . .. #1 Decoder

Friend My Hello ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <sos> Hello My Hello my friend howsam.org |

Hello my friend Linear Decoder Encoder <sos> Hello my

frien hello my d K? Q hello my friend + = Decoder QK?+ Mask Z = softmax V ?? 0 0 0 0 Mask = 0 0

zero

X1: 1 0 0 1 0 1 0 1 1 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 2 3 0 1 0 0 3 0 1 0 1 0 2 1 1 0 1 0 2 1 X2: 0 2 X = 0 WK= 0 2 WQ= WV= 1 X : 1 1 1 1 3 Z 0 1 0 1 0 1 1 1 1 0 0 0 0 4 2 1 4 3 1 0 1 1 0 2 1 1 0 1 0 2 1 = K = XWK= 0 1 Scaled Dot-Product Attention 1 1 0 0 0 0 0 1 1 0 1 1 1 0 1 0 2 1 1 0 1 0 2 1 1 2 2 0 2 1 2 2 3 Q = XWQ= = K Q V Linear Linear Linear 0 0 1 1 2 3 0 1 0 0 3 0 1 0 1 0 2 1 1 0 1 0 2 1 1 2 2 2 8 6 3 0 3 V = XWV= = X

Z 1 2 2 0 2 1 2 2 3 0 1 1 4 4 0 2 3 = 4 1 2 4 4 MatMul QK?= 16 12 12 10 4 SoftMax 2 4 4 1.15 2.30 9.23 6.92 2.30 6.92 5.77 QK? = 4 12 / 3 = 2.30 10 16 12 ?? 4 2.30 Mask 0 0 0 0 0 0 1.15 2.30 2.30 2.30 9.23 6.92 2.30 6.92 5.77 1.15 2.30 2.30 6.92 5.77 9.23 QK? + Mask = + = Scale ?? 1.00 0.00 0.00 0.00 0.99 0.75 0.00 0.00 0.24 QK? MatMul softmax = ?? K Q V 1.00 2 7.9 0 2.00 3.00 QK? Z = softmax V = 1.99 7.48 0.74 ?

Friend My Hello ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <sos> Hello My Hello my friend howsam.org |

Hello my friend Linear Decoder Encoder <sos> Hello my

Positional Encoding Hello my friend Linear Decoder Encoder + + + + + + <sos> Hello my

Positional Encoding +

1 ? = 2? ? = 2? + 1 sin ??? cos ??? ??= { ??,i= 2? 10000? i t

ViT Mlp Head Class ? MLP Transformer Encoder Norm 0* 1 2 3 4 5 6 7 8 9 Patch + Position Embedding Multi-Head Attention Linear Projection of Flatten Patches Norm Embedded Patches

Understanding Transformer Architecture for Students

Download Presentation

Presentation Transcript

Related

More Related Content