K-Means Clustering Techniques and Hypothesis Testing in CSCE 587

1 / 41

Embed Share

Discover the fundamentals of K-Means Clustering, its applications, and reasons to choose or be cautious, alongside Hypothesis Testing concepts in CSCE 587. Explore various methods and approaches in analytics theory and basic data analysis using R.

dea_tok Follow

Uploaded on Apr 13, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

CSCE 587 Midterm Review

K-means Clustering

K-Means Clustering - What is it? Used for clustering numerical data, usually a set of measurements about objects of interest. Input: numerical. There must be a distance metric defined over the variable space. Euclidian distance Output: The centers of each discovered cluster, and the assignment of each input datum to a cluster. Centroid Module 4: Analytics Theory/Methods 3

Picking K Heuristic: find the "elbow" of the within-sum-of-squares (wss) plot as a function of K. K: # of clusters ni: # points in ith cluster ci: centroid of ith cluster xij: jth point of ith cluster "Elbows" at k=2,4,6 Module 4: Analytics Theory/Methods 4

K-Means Clustering - Reasons to Choose (+) and Cautions (-) Reasons to Choose (+) Easy to implement Easy to assign new data to existing clusters Which is the nearest cluster center? Concise output Coordinates the K cluster centers Cautions (-) Doesn't handle categorical variables Sensitive to initialization (first guess) Variables should all be measured on similar or compatible scales Not scale-invariant! K (the number of clusters) must be known or decided a priori Wrong guess: possibly poor results Tends to produce "round" equi-sized clusters. Not always desirable 5 Module 4: Analytics Theory/Methods

Hypothesis Testing CSCE 587

Background Assumptions: Samples reflect an underlying distribution If two distributions are different, the samples should reflect this. A difference should be testable

Intuition: Difference of Means If m1 = m2, this area is large Module 3: Basic Data Analytic Methods Using R 8

Test Selection One sample test: compare sample to population Actually: sample mean vs population mean Does the sample match the population Two sample test: compare two samples Sample distribution of the difference of the means

One Sample t-Test population sample / = t # SD observatio ns sample Test: One Sample t-test Used only for tests of the population mean Null hypothesis: the means are the same Compare the mean of the sample with the mean of the population What do we know about the population?

Confidence Intervals Example: Gaussian data N( , ) x is the estimate of based on n samples falls in the interval x 2 / n with approx. 95% probability ("95% confidence") If x is your estimate of some unknown value , the P% confidence interval is the interval around x that will fall in, with probability P. Module 3: Basic Data Analytic Methods Using R 11

Association Rules CSCE 587

Apriori Algorithm - What is it? Support Earliest of the association rule algorithms Frequent itemset: a set of items L that appears together "often enough : Formally: meets a minimum support criterion Support: the % of transactions that contain L Apriori Property: Any subset of a frequent itemset is also frequent It has at least the support of its superset Module 4: Analytics Theory/Methods 13

Apriori Algorithm (Continued) Confidence Iteratively grow the frequent itemsets from size 1 to size K (or until we run out of support). Apriori property tells us how to prune the search space Frequent itemsets are used to find rules X->Y with a minimum confidence: Confidence: The % of transactions that contain X, which also contain Y Output: The set of all rules X -> Y with minimum support and confidence Module 4: Analytics Theory/Methods 14

Lift and Leverage Module 4: Analytics Theory/Methods 15

Apriori - Reasons to Choose (+) and Cautions (-) Reasons to Choose (+) Easy to implement Uses a clever observation to prune the search space Apriori property Easy to parallelize Cautions (-) Requires many database scans Exponential time complexity Can mistakenly find spurious (or coincidental) relationships Addressed with Lift and Leverage measures 16 Module 4: Analytics Theory/Methods

Linear Regression CSCE 587

Linear Regression -What is it? Used to estimate a continuous value as a linear (additive) function of other variables Income as a function of years of education, age, gender House price as function of median home price in neighborhood, square footage, number of bedrooms/bathrooms Neighborhood house sales in the past year based on unemployment, stock price etc. Input variables can be continuous or discrete. Output: A set of coefficients that indicate the relative impact of each driver. A linear expression for predicting outcome as a function of drivers. Module 4: Analytics Theory/Methods 18

Technical Description Solve for the bi Ordinary Least Squares storage quadratic in number of variables must invert a matrix Categorical variables are expanded to a set of indicator variables, one for each possible value. Module 4: Analytics Theory/Methods 19

Linear Regression - Reasons to Choose (+) and Cautions (-) Reasons to Choose (+) Concise representation (the coefficients) Cautions (-) Does not handle missing values well Robust to redundant variables, correlated variables Lose some explanatory value Assumes that each variable affects the outcome linearly and additively Variable transformations and modeling variable interactions can alleviate this A good idea to take the log of monetary amounts or any variable with a wide dynamic range Can't handle variables that affect the outcome in a discontinuous way Step functions Doesn't work well with discrete drivers that have a lot of distinct values For example, ZIP code Explanatory value Relative impact of each variable on the outcome Easy to score data 20 Module 4: Analytics Theory/Methods

Logistic Regression CSCE 587

The Logistic Curve ( ) ( ) exp 1 exp + z p = = = ( ) ln z LOGIT p p (1 ) p z p (probability) exp( ) 1 exp( ) + z = p z p = = ( ) log(1 LOGIT p z ) p z (log odds)

The Logistic Regression Model predictor variables ( ) ( ) = P Y 1-P Y + + + + ln X X X 0 1 1 2 2 K K dichotomous outcome ( ) ( ) Y P P Y ln is the log(odds) of the outcome. 1

Relationship between Odds & Probability ( ) Probability event ( ) Odds event = 1-Probability event ( ) ( ) Odds event 1+Odds event ( ) = Probability event ( )

Logistic Regression - Reasons to Choose (+) and Cautions (-) Reasons to Choose (+) Cautions (-) Explanatory value: Relative impact of each variable on the outcome in a more complicated way than linear regression Robust with redundant variables, correlated variables Lose some explanatory value Does not handle missing values well Assumes that each variable affects the log-odds of the outcome linearly and additively Variable transformations and modeling variable interactions can alleviate this A good idea to take the log of monetary amounts or any variable with a wide dynamic range Concise representation with the the coefficients Cannot handle variables that affect the outcome in a discontinuous way. Step functions Doesn't work well with discrete drivers that have a lot of distinct values For example, ZIP code Easy to score data Returns good probability estimates of an event Preserves the summary statistics of the training data "The probabilities equal the counts" 25 Module 4: Analytics Theory/Methods

Nave Bayes CSCE 587

Bayesian Classification Problem statement: Given features X1,X2, ,Xn Predict a label Y

The Bayes Classifier Use Bayes Rule! Likelihood Prior Normalization Constant Why did this help? Well, we think that we might be able to specify how features are generated by the class label

Model Parameters The problem with explicitly modeling P(X1, ,Xn|Y) is that there are usually way too many parameters: We ll run out of space We ll run out of time And we ll need tons of training data (which is usually not available)

The Nave Bayes Model The Na ve Bayes Assumption: Assume that all features are independent given the class label Y Equationally speaking: (We will discuss the validity of this assumption later)

Why is this useful? # of parameters for modeling P(X1, ,Xn|Y): 2(2n-1) # of parameters for modeling P(X1|Y), ,P(Xn|Y) 2n

Nave Bayesian Classifier - Reasons to Choose (+) and Cautions (-) Reasons to Choose (+) Handles missing values quite well Cautions (-) Numeric variables have to be discrete (categorized) Intervals Sensitive to correlated variables "Double-counting" Not good for estimating probabilities Stick to class label or yes/no Robust to irrelevant variables Easy to implement Easy to score data Resistant to over-fitting Computationally efficient Handles problems Handles categorical variables with a lot of levels very high dimensional 32 Module 4: Analytics Theory/Methods

Decision Trees CSCE 587

Decision Tree Example of Visual Structure Female Male Gender Female Male Branch outcome of test Income Age Internal Node decision on variable <=45,000 >45,000 <=40 >40 Yes No Leaf Node class label Yes No Income Age 34 Module 4: Analytics Theory/Methods

General Algorithm To construct tree T from training set S If all examples in S belong to some class in C, or S is sufficiently "pure", then make a leaf labeled C. Otherwise: select the most informative attribute A partition S according to A s values recursively construct sub-trees T1, T2, ..., for the subsets of S The details vary according to the specific algorithm CART, ID3, C4.5 but the general idea is the same Module 4: Analytics Theory/Methods 35

Step 1: Pick the Most Informative" Attribute Entropy-based methods are one common way H = 0 if p(c) = 0 or 1 for any class So for binary classification, H=0 is a "pure" node H is maximum when all classes are equally probable For binary classification, H=1 when classes are 50/50 Module 4: Analytics Theory/Methods 36

Step 1: Pick the Most Informative" Attribute Information Gain The information that you gain, by knowing the value of an attribute So the "most informative" attribute is the attribute with the highest InfoGain Module 4: Analytics Theory/Methods 37

Decision Tree Classifier - Reasons to Choose (+) & Cautions (-) Reasons to Choose (+) Cautions (-) Takes any input type (numeric, categorical) In principle, can handle categorical variables with many distinct values (ZIP code) Robust with redundant variables, correlated variables Decision surfaces can only be axis-aligned Tree structure is sensitive to small changes in the training data A "deep" tree is probably over-fit Because each split reduces the training data for subsequent splits Not good for outcomes that are dependent on many variables Related to over-fit problem, above Doesn't naturally handle missing values; However most implementations include a method for dealing with this In practice, decision rules can be fairly complex Naturally handles variable interaction Handles variables that have non-linear effect on outcome Computationally efficient to build Easy to score data Many algorithms can return a measure of variable importance In principle, decision rules are easy to understand 38 Module 4: Analytics Theory/Methods

Conclusion CSCE 587

Typical Questions Which Classifier Should I Try? Recommended Method Do I want class probabilities, rather than just class labels? Logistic regression Decision Tree Do I want insight into how the variables affect the model? Logistic regression Decision Tree Is the problem high-dimensional? Na ve Bayes Do I suspect some of the inputs are correlated? Decision Tree Logistic Regression Do I suspect some of the inputs are irrelevant? Decision Tree Na ve Bayes Are there categorical variables with a large number of levels? Na ve Bayes Decision Tree Are there mixed variable types? Decision Tree Logistic Regression Is there non-linear data or discontinuities in the inputs that will affect the outputs? Decision Tree 40 Module 4: Analytics Theory/Methods