Advanced Methods in Classification: Bayesian Techniques and Theorems

knowledge data discovery topic 9 classification n.w
1 / 36
Embed
Share

Explore advanced classification methods like Bayes classifiers and Bayesian belief networks. Learn about Bayesian classification, Bayes theorem basics, prediction based on Bayes theorem, and more. Understand how these techniques help in probabilistic prediction and optimal decision making in data mining. Leveraging the power of Bayes theorem, you can effectively determine posterior probabilities and make informed predictions in various scenarios.

  • Classification
  • Bayesian Techniques
  • Bayes Classifiers
  • Data Mining
  • Probabilistic Prediction

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Knowledge Data Discovery TOPIC 9 - Classification: Advanced Methods (1) Antoni Wibowo

  2. COURSE OUTLINE 1. BAYES CLASSIFIERS 2. NAIVE BAYES CLASSIFIERS 3. BAYESIAN BELIEF NETWORKS 4. ARTIFICIAL NEURAL NETWORKS (ANN)

  3. Note: This slides are based on the additional material provided with the textbook that we use: J. Han, M. Kamber and J. Pei, Data Mining: Concepts and Techniques and P. Tan, M. Steinbach, and V. Kumar "Introduction to Data Mining .

  4. Beyes Classifiers

  5. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes Theorem. Performance: A simple Bayesian classifier, na ve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 5

  6. Bayes Theorem: Basics M = | P Total probability Theorem: = ( ) ( | ) ( ) P B P B A P iA i 1 H i X ( ) ( ) P P H Bayes Theorem: = = X X X ( | ) ( | ) ( / ) ( ) P H P H P H P X ( ) Let Xbe a data sample ( evidence ): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X P(H) (prior probability): the initial probability E.g., Xwill buy computer, regardless of age, income, P(X): probability that sample data is observed P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds E.g.,Given that X will buy computer, the prob. that X is 31..40, medium income 6

  7. Prediction Based on Bayes Theorem Given training dataX, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem | X ( | P ) ( ) P H P H = = X X X ( ) ( | ) ( / ) ( ) P H P H P H P X ( ) Informally, this can be viewed as posteriori = likelihood x prior/evidence Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost 7

  8. Classification is to Derive the Maximum Posteriori Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn) Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes theorem X ( | P ) ( ) P C P C i i = X ( | ) P C i X ( ) Since P(X) is constant for all classes, only ) | ( i X = X ( | ) ( ) P C P C P C i i needs to be maximized 8

  9. Naive Bayes Classifiers

  10. Nave Bayes Classifier A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): ( ) | ( k = n = = X | ) ( | ) ( | ) ... ( | ) P P P P P Ci x Ci x Ci x Ci x Ci 1 2 k n 1 This greatly reduces the computation cost: Only counts the class distribution If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean and standard deviation ) , , ( x g 2 ( ) x 1 = 2 e 2 , i C 2 , and P(xk|Ci) is = X ( | ) ( ) P g kx Ci C 10 i

  11. Nave Bayes Classifier: Training Dataset income student credit_rating high no high no 31 40 high >40 medium >40 low >40 low 31 40 low <=30 medium <=30 low >40 medium <=30 medium 31 40 medium 31 40 high >40 medium age buys_computer no excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent Class: C1:buys_computer = yes C2:buys_computer = no <=30 <=30 fair no yes yes yes no yes no yes yes yes yes yes no no no yes yes yes no yes yes yes no yes no Data to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair) 11

  12. Nave Bayes Classifier: An Example income student credit_rating high no high no 31 40 high >40 medium >40 low >40 low 31 40 low <=30 medium <=30 low >40 medium <=30 medium 31 40 medium 31 40 high >40 medium age buys_computer no excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent <=30 <=30 fair no yes yes yes no yes no yes yes yes yes yes no no no yes yes yes no yes yes yes no yes no P(buys_computer = no ) = 5/14= 0.357 Compute P(X|Ci) for each class P(age = <=30 | buys_computer = yes ) = 2/9 = 0.222 P(age = <= 30 | buys_computer = no ) = 3/5 = 0.6 P(income = medium | buys_computer = yes ) = 4/9 = 0.444 P(income = medium | buys_computer = no ) = 2/5 = 0.4 P(student = yes | buys_computer = yes) = 6/9 = 0.667 P(student = yes | buys_computer = no ) = 1/5 = 0.2 P(credit_rating = fair | buys_computer = yes ) = 6/9 = 0.667 P(credit_rating = fair | buys_computer = no ) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = yes ) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = no ) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = yes ) * P(buys_computer = yes ) = 0.028 P(X|buys_computer = no ) * P(buys_computer = no ) = 0.007 Therefore, X belongs to class ( buys_computer = yes ) P(Ci): P(buys_computer = yes ) = 9/14 = 0.643 12

  13. Avoiding the Zero- Probability Problem Na ve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero = k 1 n = ( | ) ( | ) P X P Ci xk Ci Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) Use Laplacian correction (or Laplacian estimator) Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The corrected prob. estimates are close to their uncorrected counterparts 13

  14. Nave Bayes (Summary) Advantages Easy to implement Good results obtained in most of the cases Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes Use other techniques such as Bayesian Belief Networks (BBN)

  15. Nave Bayes (Summary) Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Na ve Bayes Classifier 15

  16. Beyesian Belief Networks

  17. Bayesian Belief Networks Bayesian belief network (also known as Bayesian network, probabilistic network): allows class conditional independencies between subsets of variables Two components: (1) A directed acyclic graph (called a structure) and (2) a set of conditional probability tables (CPTs) A (directed acyclic) graphical model of causal influence relationships Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables Links: dependency X and Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops/cycles Y X Z P 17

  18. A Bayesian Network and Some of Its CPTs CPT: Conditional Probability Tables Tampering (T) Fire Smoke s|f Fire (F) True True .90 False True .01 Fire Tampering Alarm a|f,t Alarm (A) Smoke (S) True True True .5 True False True .99 False True True .85 False False True .0001 Report (R) Leaving (L) CPT shows the conditional probability for each possible combination of its parents n = = ( ,..., ) ( | )) P x x P Derivation of the probability of a particular combination of values of X, from CPT: ( Parents x xi i 1 n 1 i 18

  19. How Are Bayesian Networks Constructed? Subjective construction: Identification of (direct) causal structure People are quite good at identifying direct causes from a given set of variables & whether the set contains all relevant direct causes Markovian assumption: Each variable becomes independent of its non- effects once its direct causes are known E.g., S F A T, path S A is blocked once we know F A HMM (Hidden Markov Model): often used to model dynamic systems whose states are not observable, yet their outputs are Synthesis from other specifications E.g., from a formal system design: block diagrams & info flow Learning from data E.g., from medical records or student admission record Learn parameters give its structure or learn both structure and parms Maximum likelihood principle: favors Bayesian networks that maximize the probability of observing the given data set 19

  20. Training Bayesian Networks: Several Scenarios Scenario 1: Given both the network structure and all variables observable: compute only the CPT entries Scenario 2: Network structure known, some variables hidden: gradient descent (greedy hill-climbing) method, i.e., search for a solution along the steepest descent of a criterion function Weights are initialized to random probability values At each iteration, it moves towards what appears to be the best solution at the moment, w.o. backtracking Weights are updated at each iteration & converge to local optimum Scenario 3: Network structure unknown, all variables observable: search through the model space to reconstruct network topology Scenario 4: Unknown structure, all hidden variables: No good algorithms known for this purpose D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed. MIT Press, 1999. 20

  21. Artificial Neural Networks (ANN)

  22. Classification by Backpropagation Backpropagation: A neural network learning algorithm Started by psychologists and neurobiologists to develop and test computational analogues of neurons A neural network: A set of connected input/output units where each connection has a weight associated with it During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples Also referred to as connectionist learning due to the connections between units 22

  23. Neuron: A Hidden/Output Layer Unit bias k x0 x1 w0 w1 f output y xn wn For Example n = = y sign( ) w ix i k Input vector x weight vector w weighted sum Activation function i 0 An n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping The inputs to unit are outputs from the previous layer. They are multiplied by their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it. 23

  24. Artificial Neural Networks (ANN) Black box X1 1 1 1 1 0 0 0 0 X2 0 0 1 1 0 1 1 0 X3 0 1 0 1 1 0 1 0 Y 0 1 1 1 0 0 1 0 Input X1 Output X2 Y X3 Output Y is 1 if at least two of the three inputs are equal to 1.

  25. Artificial Neural Networks (ANN) Input nodes Black box X1 1 1 1 1 0 0 0 0 X2 0 0 1 1 0 1 1 0 X3 0 1 0 1 1 0 1 0 Y 0 1 1 1 0 0 1 0 Output node X1 0.3 0.3 X2 Y X3 0.3 t=0.4 = + + is z 3 . 0 ( I 3 . 0 3 . 0 4 . 0 ) 0 Y X X X 1 2 3 1 if true = where ( ) I z 0 otherwise

  26. Artificial Neural Networks (ANN) Input nodes Black box Output node X1 Model is an assembly of inter-connected nodes and weighted links Output node sums up each of its input value according to the weights of its links Compare output node against some threshold t w1 w2 X2 Y w3 X3 t Perceptron Model = ( ) Y I w X t or i i i = ( ) Y sign w X t i i i

  27. How A Multi-Layer Neural Network Works The inputs to the network correspond to the attributes measured for each training tuple Inputs are fed simultaneously into the units making up the input layer They are then weighted and fed simultaneously to a hidden layer The number of hidden layers is arbitrary, although usually only one The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction The network is feed-forward: None of the weights cycles back to an input unit or to an output unit of a previous layer From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely approximate any function 27

  28. A Multi-Layer Feed-Forward Neural Network Output vector ) 1 + = + ( ( ) ( ) k k k y ( ) w w y x j j i i ij Output layer Hidden layer wij Input layer 28 Input vector: X

  29. Defining a Network Topology Decide the network topology: Specify # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer Normalize the input values for each attribute measured in the training tuples to [0.0 1.0] One input unit per domain value, each initialized to 0 Output, if for classification and more than two classes, one output unit per class is used Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights 29

  30. Learning Algorithm: Backpropagation Iteratively process a set of training tuples & compare the network's prediction with the actual known target value For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value Modifications are made in the backwards direction: from the output layer, through each hidden layer down to the first hidden layer, hence backpropagation Steps Initialize weights to small random numbers, associated with biases Propagate the inputs forward (by applying activation function) Backpropagate the error (by updating weights and biases) Terminating condition (when error is very small, etc.) 30

  31. Learning Algorithm: Backpropagation Initialize the weights (w0, w1, , wk) Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples Objective function: i 2 = ( , ) E Y f w X i i i Find the weights wi s that minimize the above objective function e.g., backpropagation algorithm

  32. Efficiency and Interpretability Efficiency of backpropagation: Each epoch (one iteration through the training set) takes O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to n, the number of inputs, in worst case For easier comprehension: Rule extraction by network pruning Simplify the network structure by removing weighted links that have the least effect on the trained network Then perform link, unit, or activation value clustering The set of input and activation values are studied to derive rules describing the relationship between the input and hidden unit layers Sensitivity analysis: assess the impact that a given input variable has on a network output. The knowledge gained from this analysis can be represented in rules 32

  33. Neural Network as a Classifier Weakness Long training time Require a number of parameters typically best determined empirically, e.g., the network topology or structure. Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of hidden units in the network Strength High tolerance to noisy data Ability to classify untrained patterns Well-suited for continuous-valued inputs and outputs Successful on an array of real-world data, e.g., hand-written letters Algorithms are inherently parallel Techniques have recently been developed for the extraction of rules from trained neural networks 33

  34. Summary Evaluate Bayes classifiers Evaluate Naive Bayes classifiers Evaluate Bayesian Belief networks Artificial Neural Networks (ANN)

  35. References 1. Han, J., Kamber, M., & Pei, Y. (2006). Data Mining: Concepts and Technique . Edisi 3. Morgan Kaufman. San Francisco 2. Tan, P.N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining . Addison-Wesley. Michigan 3. Witten, I. H., & Frank, E. (2005). Data Mining : Practical Machine Learning Tools and Techniques . Second edition. Morgan Kaufmann. San Francisco 3/20/2025 Introduction 35

  36. Thank You Thank You

Related


More Related Content