Data Mining Process and Tools Overview

Slide Note

In the field of data mining, understanding the process of discovering valuable information in vast data repositories is crucial. This involves preparing and analyzing data, making predictions, and using various tools like correlation matrices and scatter plots for evaluation. The seven objectives by Lenox and Cuff provide a structured approach to data mining, including selecting appropriate algorithms and constructing models. Tools like Microsoft Business Intelligence offer a range of analysis methods such as Association Analysis and Cluster Analysis. Understanding data distributions, attribute ranges, and possible relationships between attributes is fundamental in data mining. Through the use of statistical analysis and visualization tools, data miners can uncover insights that are valid, previously unknown, and actionable.

osullivan_c Follow

Uploaded on Mar 21, 2025 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

The process of discovering useful information in large data repositories. (Tan, P-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006) Discovered information should be: Valid Previously unknown Actionable

Seven objectives of Lenox and Cuff in 2002 (based on ACM 2001 Ironman Report) Prepare and warehouse data Process data based on set of DM algorithms Analyze results Make predictions Select proper algorithm Make application Motivated to continue graduate studies in DM We have added Get to know data using statistical analysis tools Use visualization tools for analysis and review

Get to know the data. Select an appropriate data mining algorithm based on the data and the mining objective. Construct a model using the selected algorithm. Analyze the results. Make application. 1. 2. 3. 4. 5.

How is it structured? Single table/flat-file. Multi-table relationships Number of observations Number of dimensions (attributes) Compute summary statistics using tool such as MS-Excel Visually evaluate characteristics of the data

Tools developed: Correlation Matrix Scatter Plot Parallel Coordinate Plot

Distributions of data Data ranges of numeric attributes Cardinality of discrete attributes Shape of distribution Skewed Multi-model Location of outliers Identification possible relationships between attributes Identification of subpopulations within the data

Microsoft Business Intelligence Tools Association Analysis aka market basket analysis Classification Decision Trees Artificial Neural Network Bayesian Analysis Regression Cluster Analysis Custom Tools with Embedded Visual Presentation Artificial neural network for both classification and regression Self-Organizing Map (SOM) for cluster analysis

Purpose of each methodology Steps of underlying algorithm Data types supported Issues in construction and application Parameter settings Results interpretation

Does the model fit the training data too well? Need to separate available into training and validation subsets. Visual view of training progress valuable.

Mushroom edibility classifiers Classifier A Actual Edible 38% 8% Poisonous 0% 54% Predicted Edible Poisonous Classifier B Actual Edible 44% 2% Poisonous 1% 53% Predicted Edible Poisonous

Black Box - models built using sophisticated methodologies (ANN s for example) perform very well, but gaining an understanding of the model itself is difficult. Contribution of individual input attributes Nature of contribution (shape of curve) Interaction between input attributes

For a detailed presentation of the mechanics of the software deployed, attend our workshop tomorrow morning. Saturday: 8-10 AM Kachina A Microsoft SQL Server Business Intelligence Studio Visualization Tools

Data Mining Process and Tools Overview

Download Presentation

Presentation Transcript

Related

More Related Content