Training Perceptrons for Multiclass Problems in Neural Networks

neural networks part 2 training perceptrons n.w
1 / 34
Embed
Share

Understanding the training process of perceptrons for multiclass problems in neural networks. Exploring the notation for training sets, error functions, and the application of gradient descent in learning. Unveiling strategies for handling single perceptron learning and optimizing using the sigmoid function.

  • Perceptrons
  • Neural Networks
  • Training
  • Multiclass
  • Error Functions

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Neural Networks Part 2 Training Perceptrons Handling Multiclass Problems CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

  2. Training a Neural Network In linear regression, for the sum-of-squares error, we could find the best weights using a closed-form formula. In logistic regression, for the cross-entropy error, we could find the best weights using an iterative method. In neural networks, we cannot find the best weights (unless we have an astronomical amount of luck). We only have optimization methods that find local minima of the error function. Still, in recent years such methods have produced spectacular results in real-world applications. 2

  3. Notation for Training Set We define ? to be the vector of all weights in the neural network. We have a set ? of N training examples. ? = {?1,?2, ,??} Each ?? is a (D+1)-dimensional column vector. Dimension 0 is the bias input, always set to 1. ??= (1,??1,??2, ,???) We also have a set ? of N target outputs. ? = ?1,?2, ,?? ?? is the target output for training example ??. Each ?? is a K-dimensional column vector: ??= (??,1,??,2, ,??,?) Note: K typically is not equal to D. 3

  4. Perceptron Learning Before we discuss how to train an entire neural network, we start with a single perceptron. Remember: given input ??, a perceptron computes its output ? using this formula: ?(?) = ??? We use sum-of-squares as our error function. ??(?) is the contribution of training example ??: ??? =1 2?(??) ?? The overall error ? is defined as: ? ? = ?=1 Important: a single perceptron has a single output. Therefore, for perceptrons (but NOT for neural networks in general), we assume that ?? is one-dimensional. 2 ? ??? 4

  5. Perceptron Learning Suppose that a perceptron is using the step function as its activation function . ?(?) = ??? = 0,if ??? < 0 ? = 0,if ? < 0 1,if ? 0 1,if ??? 0 Can we apply gradient descent in that case? No, because ?(?) is not differentiable. Small changes of ? usually lead to no changes in ??? . The only exception is when the change in ? causes ??? to switch signs (from positive to negative, or from negative to positive). 5

  6. Perceptron Learning A better option is setting to the sigmoid function: 1 ? ? = ??? = 1 + ? ??? Then, measured just on a single training object ??, the error ??(?) is defined as: 2 2 1 2 =1 1 ??? = ?? ? ?? ?? 1 + ? ???? 2 Note: here we use the sum-of-squares error, and not the cross- entropy error that we used for logistic regression. Also note: if our neural network is a single perceptron, then the target output ?? is one-dimensional. 6

  7. Computing the Gradient 2 2=1 1 2?? ? ?? 1 ??? = 2?? 1+? ???? In this form, ??? is differentiable. If we do the calculations, the gradient turns out to be: ??? ??=1 ?? ? ?? ? ?? 1 ? ?? ?? 2 Note that ??? ?? is a (D+1) dimensional vector. It is a scalar (shown in red) multiplied by vector ??. 7

  8. Weight Update ??? ??=1 ?? ? ?? ? ?? 1 ? ?? ?? 2 So, we update the weight vector ? as follows: ? = ? ? ?? ? ?? ? ?? 1 ? ?? ?? As before, ? is the learning rate parameter. It is a positive real number that should be chosen carefully, so as not to be too big or too small. In terms of individual weights ??, the update rule is: ??= ?? ? ?? ? ?? ? ?? 1 ? ?? ??,? 8

  9. Perceptron Learning - Summary Input: Training inputs ?1,, ,??, target outputs ?1, ,?? 1. Extend each ?? to a (D+1) dimensional vector, by adding 1 (the bias input) as the value for dimension 0. 2. Initialize weights ?? to small random numbers. For example, set each ?? between -0.1 and 0.1 3. For n = 1 to N: 1. Compute ? ??. 2. For d = 0 to D: ??= ?? ? ?? ? ?? 4. If some stopping criterion has been met, exit. 5. Else, go to step 3. ? ?? 1 ? ?? ??,? 9

  10. Stopping Criterion At step 4 of the perceptron learning algorithm, we need to decide whether to stop or not. One thing we can do is: Compute the cumulative squared error E(w) of the perceptron at that point: ? ? 2 1 2 ? ? = ??? = ?? ? ?? ?=1 ?=1 Compare the current value of ? ? with the value of ? ? computed at the previous iteration. If the difference is too small (e.g., smaller than 0.00001) we stop. 10

  11. Using Perceptrons for Multiclass Problems Multiclass means that we have more than two classes. A perceptron outputs a number between 0 and 1. This is sufficient only for binary classification problems. For more than two classes, there are many different options. We will follow a general approach called one-versus- all classification. 11

  12. A Multiclass Example Suppose we have this training set: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 3 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 3 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 2 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 2 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 In this training set: We have three classes. Each training input is a five-dimensional vector. 12

  13. A Multiclass Example Suppose we have this training set: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 3 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 3 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 2 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 2 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 In this training set: Classes are numbered sequentially starting from 1. Thus, in our example, the class labels are 1, 2, 3. If your dataset uses different labels (like red , green , blue ), you should systematically change the labels to follow this convention. 13

  14. Converting to One-Versus-All Suppose we have this training set: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 3 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 3 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 2 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 2 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 Step 1: Convert each target output to a binary vector, having as many dimensions as the number of classes. In our example we have three classes, so each ?? will become a three-dimensional binary vector. 14

  15. Converting to One-Versus-All Suppose we have this training set: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 3 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 3 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 2 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 2 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 Step 1: Convert each target output to a binary vector, having as many dimensions as the number of classes. In our example we have three classes, so each ?? will become a three-dimensional binary vector. ? ?1= ?,?,? ?2= ?,?,? ?3= ?,?,? ?4= ?,?,? ?5= ?,?,? ?6= ?,?,? ? ? ? ? ? 15

  16. Converting to One-Versus-All Suppose we have this training set: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 3 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 3 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 2 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 2 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 Step 1: Convert each target output to a binary vector, having as many dimensions as the number of classes. For each ??, the i-th dimension is set as follows: If ?? belongs to class i, then set the i-th dimension of ?? to 1. Otherwise, set the i-th dimension of ?? to 0. ? ?1= ?,?,? ?2= ?,?,? ?3= ?,?,? ?4= ?,?,? ?5= ?,?,? ?6= ?,?,? ? ? ? ? ? 16

  17. Converting to One-Versus-All Suppose we have this training set: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 3 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 3 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 2 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 2 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 Step 1: Convert each target output to a binary vector, having as many dimensions as the number of classes. For each ??, the i-th dimension is set as follows: If ?? belongs to class i, then set the i-th dimension of ?? to 1. Otherwise, set the i-th dimension of ?? to 0. ?1= 0,0,1? ?2= 0,0,1? ?3= 0,1,0? ?4= 1,0,0? ?5= 0,1,0? ?6= 1,0,0? 17

  18. Converting to One-Versus-All Training set: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 3 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 3 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 2 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 2 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 Step 2: Train three separate perceptrons (as many as the number of classes). For training the first perceptron, use the first dimension of each ?? as target output for ??. ?1= 0,0,1? ?2= 0,0,1? ?3= 0,1,0? ?4= 1,0,0? ?5= 0,1,0? ?6= 1,0,0? 18

  19. Training Set for the First Perceptron Here is what the training set looks like for the first perceptron: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 0 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 0 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 0 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 0 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 Essentially, the first perceptron is trained to output 1 when, (according to the original class labels), the input belongs to class 1 . 19

  20. Converting to One-Versus-All Training set: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 3 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 3 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 2 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 2 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 Step 2: Train three separate perceptrons (as many as the number of classes). For training the second perceptron, use the second dimension of each ?? as target output for ??. ?1= 0,0,1? ?2= 0,0,1? ?3= 0,1,0? ?4= 1,0,0? ?5= 0,1,0? ?6= 1,0,0? 20

  21. Training Set for the Second Perceptron Here is what the training set looks like for the second perceptron: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 0 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 0 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 1 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 0 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 1 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 0 Essentially, the second perceptron is trained to output 1 when, (according to the original class labels), the input belongs to class 2 . 21

  22. Converting to One-Versus-All Suppose we have this training set: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 3 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 3 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 2 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 1 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 2 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 1 Step 2: Train three separate perceptrons (as many as the number of classes). For training the third perceptron, use the third dimension of each ?? as target output for ??. ?1= 0,0,1? ?2= 0,0,1? ?3= 0,1,0? ?4= 1,0,0? ?5= 0,1,0? ?6= 1,0,0? 22

  23. Training Set for the Third Perceptron Here is what the training set looks like for the third perceptron: ?1= 0.52,0.46,0.34,0.20,0.50?, ?1= 1 ?2= 0.52,0.46,0.34,0.20,0.50?, ?2= 1 ?3= 0.52,0.46,0.34,0.20,0.50?, ?3= 0 ?4= 0.52,0.46,0.34,0.20,0.50?, ?4= 0 ?5= 0.52,0.46,0.34,0.20,0.50?, ?5= 0 ?6= 0.52,0.46,0.34,0.20,0.50?, ?6= 0 Essentially, the third perceptron is trained to output 1 when, (according to the original class labels), the input belongs to class 3 . 23

  24. One-Versus-All Perceptrons Suppose we have ? classes ?1, ,??, where ? > 2. We have training inputs ?1,, ,??, and target values ?1, ,??. Each target value ?? is a K-dimensional vector: ??= (??,1,??,2, ,??,?) ??,?= 0 if the class of ?? is not Ck. ??,?= 1 if the class of ?? is Ck. For each class ??, train a perceptron ?? by using ??,? as the target value for ??. So, perceptron ?? is trained to recognize if an object belongs to class ?? or not. In total, we train ? perceptrons, one for each class. 24

  25. One-Versus-All Perceptrons To classify a test pattern ?: Compute the responses ??(?) for all ? perceptrons. Find the perceptron ?? such that the value ?? (?) is higher than all other responses. Output that the class of x is ?? . In summary: we assign ? to the class whose perceptron produced the highest output value for ?. 25

  26. Multiclass Neural Networks For perceptrons, we saw that we can perform multiclass (i.e., for more than two classes) classification by training one perceptron for each class. For neural networks, we will train a SINGLE neural network, with MULTIPLE output units. The number of output units will be equal to the number of classes. 26

  27. A Multiclass Example Suppose we have this training set (same as before): ?1= 0.52,0.46,0.34,0.20,0.50?, ?2= 0.52,0.46,0.34,0.20,0.50?, ?3= 0.52,0.46,0.34,0.20,0.50?, ?4= 0.52,0.46,0.34,0.20,0.50?, ?5= 0.52,0.46,0.34,0.20,0.50?, ?6= 0.52,0.46,0.34,0.20,0.50?, In this training set: We have three classes. Each training input is a five-dimensional vector. ?1= 0,0,1? ?2= 0,0,1? ?3= 0,1,0? ?4= 1,0,0? ?5= 0,1,0? ?6= 1,0,0? 27

  28. A Network for Our Example Input layer Hidden Layer 1 Hidden Layer 2 Output layer ?1 ?6 ?10 ?14 ?2 ?11 ?7 ?3 ?15 ?12 ?8 ?4 ?16 ?13 ?9 ?5 28

  29. Note: There is ALWAYS a bias input unit ?0, that we dont show. The output of ?0 is always 1. The output ?0 is an input to ALL non-input units (in this example, units ?6,?7, ,?16). ?1 ?6 ?10 ?14 ?2 ?11 ?7 ?3 ?15 ?12 ?8 ?4 ?16 ?13 ?9 ?5 29

  30. Input Layer: In our example, it must have five units, because each training input is five-dimensional. ?1 ?6 ?10 ?14 ?2 ?11 ?7 ?3 ?15 ?12 ?8 ?4 ?16 ?13 ?9 ?5 30

  31. Hidden layers: This network has two hidden layers, with four units per layer. The number of hidden layers and the number of units per layer are hyperparameters, they can take different values. ?1 ?6 ?10 ?14 ?2 ?11 ?7 ?3 ?15 ?12 ?8 ?4 ?16 ?13 ?9 ?5 31

  32. Output layer: In our example, it must have three units, because we want to recognize three different classes. ?1 ?6 ?10 ?14 ?2 ?11 ?7 ?3 ?15 ?12 ?8 ?4 ?16 ?13 ?9 ?5 32

  33. Network connectivity: In this neural network, every non-input unit receives as input the output of ALL units in the previous layer. This is also a hyperparameter, it doesn t have to be like that. ?1 ?6 ?10 ?14 ?2 ?11 ?7 ?3 ?15 ?12 ?8 ?4 ?16 ?13 ?9 ?5 33

  34. Next: Training The next set of slides will describe how to train such a network. Training a neural network is done using gradient descent. The specific method is called backpropagation, but it really is just a straightforward application of gradient descent for neural networks. 34

More Related Content